I’m running exactly the same program with the same execution configuration in both a Tesla C1060 and one GTS 250. The execution configuration is: 32 thread blocos each one with 96 threads, 2212 Kb of shared memory per block and 14 registers per thread. Strangely the execution time in GTS 250 is approximately 2 seconds minor than in Tesla. Could some good soul give me any hint on this?
Is this someway related to latency hiding?