I know this might be rather broad without context of my application, but is there any reason why I am seeing slower run times when I compile my code with -arch sm_20 as opposed to arch sm_13?
The app is running on a Tesla C2050 with CUDA 3.2 on a Linux platform. If I compile with -arch sm_13 the average runtime is 38.6 +/- .01 seconds. If I make absolutely no changes to the code and I just change the makefile so it compiles with -arch sm_20, runtimes drop to 34.6 +/- .01 seconds. I would expect at the least the runtimes to stay the same? The thread block size is the only CUDA specific parameter and was optimized for a Tesla C1060, but even if I change the thread block size up or down for the C2050 and recompile with sm_20, the best runtime is the 34.6 second time.