Dear forum users,
I recently stumbled upon some phenomena that appears strange to me. I have a kernel simulating something, not too difficult. It uses shared memory arrays and the register memory is NOT full to the brim. All computations are run in single precision on a Tesla C1060, having CUDA compute capability 1.3 (sm_13 in nvcc compiler settings)
However, compared to the sm_10 compiled program, the sm_13 compiled program runs 30% slower. What could cause such a significant slowdown? In general the whole compiler optimization is very opaque in my opinion. Any methods beyond trial and error?
Thank you for your opinion