I have a vs2015 program, as an executable, containing two kernels that do some doubles work, but not a lot. The same exe takes 12s on my old quadro K2000, and 52s on the 1060. Even allowing for the GeForce being a graphics card, and the quadro a compute card, that’s a big difference. Can anyone enlighten me on why?
The GTX 1060 provides 120 DP GFLOPS, the K2000 provides 30 DP GFLOPS. Hypotheses:
(1) Misattribution of the performance data. Run time is actually 12 seconds on the GTX 1060, and 52 seconds on the K2000
(2) Benchmarking methodology: The timed portion of the benchmark captures much more than just kernel execution. Use the CUDA profiler to gain clarity.
(3) The kernels are not actually functionally equivalent, maybe due to a different configuration parameter, leading to differences in run time
(4) The slower kernel is the result of a debug build, the faster kernel is the result of a release build
(5) The slower kernel is affected by JIT overhead, because the fat binary does not contain machine code (SASS) for both GPU architectures used
Likely (5) as CUDA 8 template project in VS2015 defaults to compute_20,sm_20