How to decide the optimal block size in CUDA


I am a newbie using CUDA. Can some one tell me how to set the block size? From some docs, I am implied that the block size was arbitrarily set by programmer. Is it true if I set any block size up to 768 threads? Here, 768 is the threads boundary for each of SM.

What is the tradeoff between small and large block sizes?

Thanks a lot!


A block can only have up to 512 threads, however in a grid you can have many many blocks (up to 65535 x 65535)

1.0-1.1 compute capable devices support up to 768 active threads on an SM, which means if you had 512 threads in your block you could only have 1 active block on the SM. With a 256 thread block you could have 3 active blocks per SM.
1.2-1.3 support up to 1024 active threads, but still only 512 threads per block.

Take a look at the CUDA Occupacy Calculator

The per block thread limit is 512 threads, not 768. Each block has affinity to one multiprocessor, and each multiprocessor has limited resources - it can schedule 8 block simultaneously, it can schedule 768 or 1024 threads simultaneously, it has a total of 8192 or 16384 registers and 16kb of shared memory. Usually you set the block size to try and maximise the use of those resources, and the number you choose depends largely on how resource intensive the kernel is. Nvidia provide a spreadsheet which shows how kernel resource requirements effects execution which is worth playing with - you can get it from here

the limit is 512. Usually 128-256 is a good number and is dependent on the resources used (registers and shared memory).

You want as many threads as possible to be scheduled on each multi processor (more or less 192 is a good minimum) to be able to hide latency issues when accessing registers.

There is a limit on the number of active blocks per multi gpu so you want large enough blocks to fill that minimum. On the other hand you need enough resources so if the block uses a lot of registers or shared memory there won’t be enough resources to schedule the block.

You also want the block to be a multiple of 32 (warp size) or you’ll be wasting threads, and have if possible groups of 16 thread accessing consecutive memory for coalescing.

On the other hand you want enough blocks to schedule over all the multi processors to fully utilize the card so you don’t want the blocks too large if your problem is small

I find that for 2D problems, having blocks of 16x16 or 16x12 or at the worst 16x8 do a good job

Optimal depends on what you are doing. Generally it is more efficient to make the number of threads per block a multiple of 32, e.g. 192
But if there are good reasons to not use a multiple of 32 don’t.

The actual number is passed as part of the kernel call e.g. in the host code

for ( blah )
MyFunct<<< gridDim, blockDim >>> ( ( d_param1, d_param2 );

where earlier in the host code for a 1D grid you have something like
dim3 gridDim( TotalBlocks );
dim3 blockDim( ThreadsPerBlock ); // I usually #define ThreadsPerBlock

or for a 2D grid you might have
dim3 gridDim( 10, 20 );
dim3 blockDim( 16,32 );

(see CUDA Programming Guide, In version 2.3 its Appendix B.12. )