I simulated a bunch of particles going through a couple transformations. Each particle has its own thread and is going through a number of kernels. Horizontally you have the number of kernels, and vertically you have the number of particles (threads).
The simualtion was timed, and the number you see there is the average CPU simulation time / average GPU simulation time. (CPU simulation used block type ‘striding’, one thread per core). So what you see is the efficaccy, or how many times faster the GPU was than the CPU.
GeForce GTX 660M
(yes, in some instances the TESLA is up to 200* faster than a 4 (8 hyperthread) core cpu)
Noticeable is that a large number of threads greatly diminish the efficaccy. So to combat this, it would probably be better to use striding kernels.
Now the question is, how do you determine the thread count sweetspot for each GPU?
the GTX has ~380 cores, while the tesla has ~440, but if you look at the chart, the tesla’s sweetspot is about 10-20* higher.
So how do you reliably determine the optimum number of striding kernel threads for a card without benchmarking it?