How to best choose <<grids, threads>>? for best performance, how do i best choose gr

qtmlr · June 13, 2011, 3:46pm

Hi,

i wrote a kernel that does not use shared memory and that works on a problem that can be split into parallel parts very well.

At runtime i get to calculate how many parts there are. In one example it is approx. 15000 parts.

In the kernel i calculate / loop “tid”:

for(tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < max_tid;
tid += blockDim.x * gridDim.x)
{
if(tid < max_tid) {
… do the work
}
}

Starting the kernel as <<max_tid, 1>> seemed reasonable to me, but gives bad results.
Starting the kernel <<mx_tid, 50>> gives much better results, but i can’t explain why.

Is there a rule or formula to get the best performance?

Thanks for any hints,
Torsten.

qtmlr · June 13, 2011, 3:46pm

Hi,

i wrote a kernel that does not use shared memory and that works on a problem that can be split into parallel parts very well.

At runtime i get to calculate how many parts there are. In one example it is approx. 15000 parts.

In the kernel i calculate / loop “tid”:

for(tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < max_tid;
tid += blockDim.x * gridDim.x)
{
if(tid < max_tid) {
… do the work
}
}

Starting the kernel as <<max_tid, 1>> seemed reasonable to me, but gives bad results.
Starting the kernel <<mx_tid, 50>> gives much better results, but i can’t explain why.

Is there a rule or formula to get the best performance?

Thanks for any hints,
Torsten.

L_F · June 13, 2011, 7:50pm

It depends on GPU. You may try “CUDA Occupancy Calculator” from http://developer.nvidia.com/cuda-toolkit-32-downloads to optimize the performance.

L_F · June 13, 2011, 7:50pm

It depends on GPU. You may try “CUDA Occupancy Calculator” from http://developer.nvidia.com/cuda-toolkit-32-downloads to optimize the performance.

DrAnderson42 · June 14, 2011, 12:16am

Why does this seem reasonable at all? The smallest unit of execution in the hardware is a warp of 32 threads. Using a block size of 1 fails to utilize 31/32 of the device.

No. Choose block sizes that are multiples of 32, or you are just wasting time. Then, benchmark all block sizes that you are able to run your algorithm on without getting “too many resources requested for launch” and choose the fastest.

DrAnderson42 · June 14, 2011, 12:16am

Why does this seem reasonable at all? The smallest unit of execution in the hardware is a warp of 32 threads. Using a block size of 1 fails to utilize 31/32 of the device.

No. Choose block sizes that are multiples of 32, or you are just wasting time. Then, benchmark all block sizes that you are able to run your algorithm on without getting “too many resources requested for launch” and choose the fastest.

Topic		Replies	Views
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7006	January 30, 2008
Grids and Threads question CUDA Programming and Performance	2	4421	August 7, 2007
low concurrency and low kernel utilization, but kernels are filled. CUDA Programming and Performance	6	1406	November 18, 2018
help with some cuda programming CUDA Programming and Performance	9	1817	August 31, 2009
About grid size and performance CUDA Programming and Performance	10	2405	June 25, 2010
CUDA perormances CUDA Programming and Performance	10	7126	January 22, 2008
help to clairfy usage of number of grids and number of blocks in kernal CUDA Programming and Performance	0	611	February 14, 2014
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	587	June 2, 2011
Newbie help on thread blocks CUDA Programming and Performance	22	10595	December 24, 2008
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007

How to best choose <<grids, threads>>? for best performance, how do i best choose gr

Related topics