How to decide the optimal block size in CUDA

lujx04 · February 12, 2010, 5:27pm

Hi,

I am a newbie using CUDA. Can some one tell me how to set the block size? From some docs, I am implied that the block size was arbitrarily set by programmer. Is it true if I set any block size up to 768 threads? Here, 768 is the threads boundary for each of SM.

What is the tradeoff between small and large block sizes?

Thanks a lot!

Jason

apangborn · February 12, 2010, 5:40pm

A block can only have up to 512 threads, however in a grid you can have many many blocks (up to 65535 x 65535)

1.0-1.1 compute capable devices support up to 768 active threads on an SM, which means if you had 512 threads in your block you could only have 1 active block on the SM. With a 256 thread block you could have 3 active blocks per SM.
1.2-1.3 support up to 1024 active threads, but still only 512 threads per block.

Take a look at the CUDA Occupacy Calculator

avidday · February 12, 2010, 5:41pm

The per block thread limit is 512 threads, not 768. Each block has affinity to one multiprocessor, and each multiprocessor has limited resources - it can schedule 8 block simultaneously, it can schedule 768 or 1024 threads simultaneously, it has a total of 8192 or 16384 registers and 16kb of shared memory. Usually you set the block size to try and maximise the use of those resources, and the number you choose depends largely on how resource intensive the kernel is. Nvidia provide a spreadsheet which shows how kernel resource requirements effects execution which is worth playing with - you can get it from here

laughingrice · February 13, 2010, 6:49am

the limit is 512. Usually 128-256 is a good number and is dependent on the resources used (registers and shared memory).

You want as many threads as possible to be scheduled on each multi processor (more or less 192 is a good minimum) to be able to hide latency issues when accessing registers.

There is a limit on the number of active blocks per multi gpu so you want large enough blocks to fill that minimum. On the other hand you need enough resources so if the block uses a lot of registers or shared memory there won’t be enough resources to schedule the block.

You also want the block to be a multiple of 32 (warp size) or you’ll be wasting threads, and have if possible groups of 16 thread accessing consecutive memory for coalescing.

On the other hand you want enough blocks to schedule over all the multi processors to fully utilize the card so you don’t want the blocks too large if your problem is small

I find that for 2D problems, having blocks of 16x16 or 16x12 or at the worst 16x8 do a good job

kbam · February 15, 2010, 1:28am

Optimal depends on what you are doing. Generally it is more efficient to make the number of threads per block a multiple of 32, e.g. 192
But if there are good reasons to not use a multiple of 32 don’t.

The actual number is passed as part of the kernel call e.g. in the host code

for ( blah )
{
MyFunct<<< gridDim, blockDim >>> ( ( d_param1, d_param2 );
}

where earlier in the host code for a 1D grid you have something like
dim3 gridDim( TotalBlocks );
dim3 blockDim( ThreadsPerBlock ); // I usually #define ThreadsPerBlock

or for a 2D grid you might have
dim3 gridDim( 10, 20 );
dim3 blockDim( 16,32 );

(see CUDA Programming Guide, In version 2.3 its Appendix B.12. )

Topic		Replies	Views
The choose of grid size and block size CUDA Programming and Performance	8	3256	May 8, 2024
How to use "block" and "thread" CUDA Programming and Performance	5	1258	October 16, 2013
Questions about Block and Grid CUDA Programming and Performance	4	3548	February 26, 2008
How to determine the Block Size CUDA Programming and Performance	1	5904	September 4, 2009
maximum threads per block not always used CUDA Programming and Performance	2	754	June 14, 2018
Thread Scheduling / Limit maximum threads per block in each dimension vs Maximum thread on a SM CUDA Programming and Performance	3	1759	June 21, 2012
finding the best number of threads per block CUDA Programming and Performance	3	7851	January 29, 2010
Is there a way to specify the number of blocks per SM but not the block size? CUDA Programming and Performance	9	54	November 25, 2024
Grid dimension's decision How to take decision for organization of a grid . CUDA Programming and Performance	6	5450	March 10, 2009
CUDA software and hardware mapping CUDA Programming and Performance	5	14682	February 21, 2009

How to decide the optimal block size in CUDA

Related topics