Different running time on same GPU with same code

A project which is called from MATLAB (via mex) is running about 4 times slower on a machine with exactly the same setup as the test machines.

The configuration is Win 7 64, Visual Studio 2010 x64, CUDA 5.5 with latest drivers, MATLAB 2012b and Tesla K20c(TCC).

This code uses cuBLAS and cuSPARSE, as well as custom kernels. On a colleagues machine which has almost the exact same configuration as my test machine, the running times are much slower.

All the compile flags are the same, the ECC is off on both machines. Both machines have a dedicated GPU for video out, and a Tesla GPU for calculations.

I did notice that the offending machine does have slightly lower numbers for the typical CUDA-Z tests and the CUDA SDK samples, but only by about 15%.

Before when I was just using cuBLAS the times were only slightly slower, but know they are off by a factor of 4. The only change made was the use of cuSPARSE for some matrix-vector multiplys, which sped up the routine quite a bit(the matrices are about 8% nnz).

I tested the same project code on a laptop 680m and that ran faster than the offending K20c, so I am wondering what other factors may be contributing to this issue. Also tested on a machine with a K40c and those times were slightly faster than my K20c.

The results are the same for all machines and I have run CUDA-MEMCHECK to verify there are no leaks or other errors.

What else should I look at in order to narrow down this issue?

Thanks…

Not knowing much about either the app or the systems, this calls for much speculation. The easiest way to pinpoint the source of the slowdown may be to run with the profiler on both systems, and compare the wealth of data it generates. For example, are kernel run times the same, or different. Are data transfers the same speed, or different. Is host overhead the same, or different.

I am not sure whether the runtime differences above are referring to app level or to kernel level. If they are at app level, it would make sense to eliminate host-side differences as the culprit. For example, is the slow host running expensive background tasks such as FAH or BOINC clients, or maybe swapping due to lack of memory? Have you tried eliminating existing differences between the machines, and power cycling to ensure approximately identical software states? Are the GPUs plugged into the correct PCIe slots (x16), the same on each system. Are SBIOS version and SBIOS settings relevant to PCIe identical?

Are the GPUs supplied with adequate power (check connectors) and sufficiently cooled. lack of power or cooling could force clock throttling. Does nvidia-smi show the same power states etc on both systems while the app is running? Do the GPUs have the same VBIOS versions? Are both systems running the same driver package and CUDA versions?

Is the same CUDA application binary run on both systems, or is the app compiled locally on each system? If the latter, double check the builds to make sure they are identical across the system, and compared the resulting binaries. If there are multiple GPU in each system make sure the app runs on the intended one. Does the build create architecture specific machine code, or does the app rely on JIT compilation? You would want to avoid JITing as it is system-dependent overhead.

It is unlikely that there is anything specifically tied to the GPU hardware that causes this, but you could look into that by physically swapping the GPUs between the two systems.

Yes, profiling in detail would help, but I do not have physical access to that machine(in a different city). Profiling a mex interface called from MATLAB is slightly more complex than just using nvprof or nvvp on an executable, but that needs to be done next.

We have tried running the app as compiled on my machine, and also re-compiled on that machine and the result was the same. The app does not rely on JIT compilation.

I did not think of the power issue, so will look at that next.

The driver versions on the offending machine are the same, though the CPU/Motherboard/RAM/SSD setups are different. This executable does not spend much time copying memory across the bus, rather most of the running time is spent on multiple cuBLAS/cuSPARSE calls.

Those are all good areas to investigate, thanks for the help.

Was able to profile the mex through nvvp using a great guide from the guys at Orange Owl Solutions;

http://www.orangeowlsolutions.com/archives/570

which worked like a charm, and enables to me compare the outputs from the other machine running the same code.

I just wanted to share that link, because it is useful and that site has all kinds of helpful information, especially when it comes to calling CUDA mex files from MATLAB.

Also the profiling for the app looks good, and I can see concurrent launches (via streams) of both cuBLAS Sgemv() and cuSPARSE Scsrmv() calls.

Just a thought: is there more than one GPU installed in the slow system? Perhaps the code is not running on the K20c at all but on another, slower, CUDA GPU.

@cudaaduc, are either you or your workstation traveling at relativistic speeds?

Facepalm. How did I manage to overlook the possible applicability of the twin paradox and thus time dilation as a potential root cause? :-)

Knowing you, it was because you were busy writing a gate-level simulation of a K20 to look for any configurations where constructive interference clock latchups causes a 4:1 effective execution speed.

The was the first thing we checked, but it is indeed running on the Tesla.

LOL, not that I am aware of…

If you installed Nsight Visual Studio Edition and have been using the CUDA debugger please make sure that you have not configured the applications to be attachable.

Specifically, check that the environment variable NSIGHT_CUDA_DEBUGGER is not set to 1. This will cause additional code to be loaded into the application which will slow down the application.

Attach to CUDA process

The Nsight Visual Sutdio Edition CUDA Profiler should not have the same problems that NVVP has in profiling a MEX application. You will want to use “Profile CUDA Process Tree”.

That was it! I just had my colleague un-set that variable and now our running times are the same.

This is especially good because they are preparing slides for GTC, and have been using the poor running times on their graphs, so now they will be able to update in time.

Thanks !