Performance discrepancy due to compiler settings compiling program with sm_10 vs. sm_13

Dear forum users,

I recently stumbled upon some phenomena that appears strange to me. I have a kernel simulating something, not too difficult. It uses shared memory arrays and the register memory is NOT full to the brim. All computations are run in single precision on a Tesla C1060, having CUDA compute capability 1.3 (sm_13 in nvcc compiler settings)
However, compared to the sm_10 compiled program, the sm_13 compiled program runs 30% slower. What could cause such a significant slowdown? In general the whole compiler optimization is very opaque in my opinion. Any methods beyond trial and error?

Thank you for your opinion
Peter

You probably have some unintended double precision arithmetic in your code. When compiling for compute 1.0, any doubles get demoted to single precision. When compiling for compute 1.3, the doubles remain. Apart from the reduced arithmetic throughput of double precision compared to single, doubles and their math library functions also increase register pressure. It is these two things which are probably causing the performance difference you are seeing.

How fast is the program when compiled for sm_12? If it is 30% slower as well, there are double precision calculations somewhere in the code that get demoted to floats on lower compute capabilities.

Ok, I will check my code again. Actually I am quite sure that all data structures and variables are only single precision. But I agree, double precision can be the only reasonable cause. Thank you.

The most frequent cause of inadvertent double-precision computation is the use of literal floating-point constants, which in C/C++ default to double precision unless a suffix is used. In other words, 3.14 is a double-precision constant, while 3.14f is a single-precision (“float”) constant. By C/C++ type promotion rules, such unintended double-precision operands then cause other parts of the expression they are contained in to be evaluated in double precision.

You might also want to check for math-functions used without the “f” suffix, e.g. sin() instead of sinf(). Usually this is not as much of a problem since CUDA overloads the basenames of the math functions so sin(float) is treated similar to sinf(float).