How to best choose <<grids, threads>>? for best performance, how do i best choose gr

Hi,

i wrote a kernel that does not use shared memory and that works on a problem that can be split into parallel parts very well.

At runtime i get to calculate how many parts there are. In one example it is approx. 15000 parts.

In the kernel i calculate / loop “tid”:

for(tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < max_tid;
tid += blockDim.x * gridDim.x)
{
if(tid < max_tid) {
… do the work
}
}

Starting the kernel as <<max_tid, 1>> seemed reasonable to me, but gives bad results.
Starting the kernel <<mx_tid, 50>> gives much better results, but i can’t explain why.

Is there a rule or formula to get the best performance?

Thanks for any hints,
Torsten.

Hi,

i wrote a kernel that does not use shared memory and that works on a problem that can be split into parallel parts very well.

At runtime i get to calculate how many parts there are. In one example it is approx. 15000 parts.

In the kernel i calculate / loop “tid”:

for(tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < max_tid;
tid += blockDim.x * gridDim.x)
{
if(tid < max_tid) {
… do the work
}
}

Starting the kernel as <<max_tid, 1>> seemed reasonable to me, but gives bad results.
Starting the kernel <<mx_tid, 50>> gives much better results, but i can’t explain why.

Is there a rule or formula to get the best performance?

Thanks for any hints,
Torsten.

It depends on GPU. You may try “CUDA Occupancy Calculator” from http://developer.nvidia.com/cuda-toolkit-32-downloads to optimize the performance.

It depends on GPU. You may try “CUDA Occupancy Calculator” from http://developer.nvidia.com/cuda-toolkit-32-downloads to optimize the performance.

Why does this seem reasonable at all? The smallest unit of execution in the hardware is a warp of 32 threads. Using a block size of 1 fails to utilize 31/32 of the device.

No. Choose block sizes that are multiples of 32, or you are just wasting time. Then, benchmark all block sizes that you are able to run your algorithm on without getting “too many resources requested for launch” and choose the fastest.

Why does this seem reasonable at all? The smallest unit of execution in the hardware is a warp of 32 threads. Using a block size of 1 fails to utilize 31/32 of the device.

No. Choose block sizes that are multiples of 32, or you are just wasting time. Then, benchmark all block sizes that you are able to run your algorithm on without getting “too many resources requested for launch” and choose the fastest.