Diminishing Efficacy of Monolithic Kernels & Determining GPU Sweetspot

I simulated a bunch of particles going through a couple transformations. Each particle has its own thread and is going through a number of kernels. Horizontally you have the number of kernels, and vertically you have the number of particles (threads).

The simualtion was timed, and the number you see there is the average CPU simulation time / average GPU simulation time. (CPU simulation used block type ‘striding’, one thread per core). So what you see is the efficaccy, or how many times faster the GPU was than the CPU.

GeForce GTX 660M

TESLA 2070

(yes, in some instances the TESLA is up to 200* faster than a 4 (8 hyperthread) core cpu)

Noticeable is that a large number of threads greatly diminish the efficaccy. So to combat this, it would probably be better to use striding kernels.

Now the question is, how do you determine the thread count sweetspot for each GPU?

the GTX has ~380 cores, while the tesla has ~440, but if you look at the chart, the tesla’s sweetspot is about 10-20* higher.

So how do you reliably determine the optimum number of striding kernel threads for a card without benchmarking it?

Thanks!

I do not really see how this is different from any profiling/ optimization task

The kernel code implies the requirement and utility of device (and perhaps host) resources, at certain points/ arrival rates; and that shapes throughput

Perhaps key would be to understand why throughput is a function of thread count; I generally prefer smaller blocks when I have frequent global memory access, particularly at different points