A project which is called from MATLAB (via mex) is running about 4 times slower on a machine with exactly the same setup as the test machines.
The configuration is Win 7 64, Visual Studio 2010 x64, CUDA 5.5 with latest drivers, MATLAB 2012b and Tesla K20c(TCC).
This code uses cuBLAS and cuSPARSE, as well as custom kernels. On a colleagues machine which has almost the exact same configuration as my test machine, the running times are much slower.
All the compile flags are the same, the ECC is off on both machines. Both machines have a dedicated GPU for video out, and a Tesla GPU for calculations.
I did notice that the offending machine does have slightly lower numbers for the typical CUDA-Z tests and the CUDA SDK samples, but only by about 15%.
Before when I was just using cuBLAS the times were only slightly slower, but know they are off by a factor of 4. The only change made was the use of cuSPARSE for some matrix-vector multiplys, which sped up the routine quite a bit(the matrices are about 8% nnz).
I tested the same project code on a laptop 680m and that ran faster than the offending K20c, so I am wondering what other factors may be contributing to this issue. Also tested on a machine with a K40c and those times were slightly faster than my K20c.
The results are the same for all machines and I have run CUDA-MEMCHECK to verify there are no leaks or other errors.
What else should I look at in order to narrow down this issue?