I compiled my cuda code with ‘compute_10,sm_10’ and ‘compute_35,sm_35’, the 1.0 version is 30% faster than the 3.5 version. Is there anything wrong? My card has capability 3.5.
Hard to give advice without the problematic code :)
I tend to compile to CC 1.0 or 1.1 also unless I specifically require language or device features from later CUDA versions. You also get the benefit of an increased hardware capability (older devices and older drivers will run your code)
The reasons for compute 1.x targets being faster might be that up to 124 registers are available for compute 1.x targets, and the run time (JIT) conversion from PTX 1.x code to the target hardware (which may be limited to 64 registers/thread) seems to do a really good job minimizing register spills.
Also compiling to compute 1.x uses the Open64 based compiler while newer targets use the LLVM based compiler. Both may exhibit distinctly different performance characteristics due to different optimization strategies.
If the code uses double precision math, maybe it is being demoted to single precision.
The code only uses float and short, does not use DP. The code does use quite a few registers. Thanks Christian for the explanation. But I still think the newer compiler should at least do the same good job as the older ones.
Gogar may still be right. Double and triple check that you have no doubles, especially in constants. It is very easy to accidentally write code like “a+=3.14159” or “a+=2.0b". Even if a and b are both floats, those are still double precision computes because they use a double precision constant. They should be written like "a+=2.0fb” to force single precision.
Double checked and it does not use double.
Although not always illuminating (since it isn’t the final GPU machine code), you might want to see what the generated PTX looks like for both kernels.
Also, it’s worth checking how many registers both cases use. Do you try different block and grid configurations to find the best one for your kernel?
Compute capability 1.0 does not support full IEEE floating point and did not have an ABI.
If you compile for 3.5 with --use_fast_math and you specify the option to disable ABI compliance you should get comparable performance.