Diminishing Efficacy of Monolithic Kernels & Determining GPU Sweetspot

turret · August 21, 2014, 12:12am

I simulated a bunch of particles going through a couple transformations. Each particle has its own thread and is going through a number of kernels. Horizontally you have the number of kernels, and vertically you have the number of particles (threads).

The simualtion was timed, and the number you see there is the average CPU simulation time / average GPU simulation time. (CPU simulation used block type ‘striding’, one thread per core). So what you see is the efficaccy, or how many times faster the GPU was than the CPU.

GeForce GTX 660M

TESLA 2070

(yes, in some instances the TESLA is up to 200* faster than a 4 (8 hyperthread) core cpu)

Noticeable is that a large number of threads greatly diminish the efficaccy. So to combat this, it would probably be better to use striding kernels.

Now the question is, how do you determine the thread count sweetspot for each GPU?

the GTX has ~380 cores, while the tesla has ~440, but if you look at the chart, the tesla’s sweetspot is about 10-20* higher.

So how do you reliably determine the optimum number of striding kernel threads for a card without benchmarking it?

Thanks!

little_jimmy · August 21, 2014, 2:31pm

I do not really see how this is different from any profiling/ optimization task

The kernel code implies the requirement and utility of device (and perhaps host) resources, at certain points/ arrival rates; and that shapes throughput

Perhaps key would be to understand why throughput is a function of thread count; I generally prefer smaller blocks when I have frequent global memory access, particularly at different points

Topic		Replies	Views
CUDA perormances CUDA Programming and Performance	10	7130	January 22, 2008
CUDA Increasing Speed Possible ? CUDA Programming and Performance	2	4193	May 31, 2010
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6040	December 8, 2008
Tuning GPU code Profiler output interpretation CUDA Programming and Performance	5	6666	March 26, 2007
one or more kernels algorithm implementation matching GPU CUDA Programming and Performance	1	2125	August 23, 2007
GTX580 vs GTX680 SP performance CUDA Programming and Performance	1	6685	June 3, 2012
Where's my bottleneck CUDA Programming and Performance	1	1049	August 29, 2008
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7010	January 30, 2008
performance tesla vs intel core duo CUDA Programming and Performance	6	16262	January 28, 2009
Significance of Multiprocessor Cores CUDA Programming and Performance	2	7680	February 17, 2011

Diminishing Efficacy of Monolithic Kernels & Determining GPU Sweetspot

Related topics