Why is max threads per sm larger than max threads per block?

On our Volta70 gpu, the max number of treads per sm is 2048, while the max number of threads per block is 1024.

It seems that the max number of threads per sm follows from hardware limitations (4 schedulers, 16 warps per scheduler, 32 threads per warp).

However, it is not immediately clear why the max number of threads per block is smaller. Do you know what explains it best? Thanks.

1 Like

CUDA GPU SMs can have more than 1 block resident. The maximum number of threads per SM is a limit across all resident threadblocks. This number being 2048 means, for example, that 2 threadblocks of 1024 threads each could possibly be resident.

The CUDA threadblock limit of 1024 threads is pretty much a standard part of CUDA for at least the last 10 years. Compute capability 2.0 devices until current devices all have this same limit. I wouldn’t be able to offer any further explanation than that.

(cc1.x devices had a limit of 512 threads per block.)

Just guessing here:

It could be related to block-wide synchronization for certain instructions (needing more physical data lines, shadow registers, bits, …). It could also be an artificial limitation to keep compatibility of kernels to other architectures.

The maximum number of threads per block is an architectural decision based upon area cost and compatibility. Increasing the dimensions increase wires and increase area cost. This is not something that software developers generally consider but is critical in efficient hardware design. Increasing the maximum number of threads per block on some architectures (e.g. 100 class) would mean either lower end parts also need to increase the value or there will be compatibility issues. The graphics/mobile focused GPUs tend to have less threads per SM to reduce cost and still provide optimal resources for typical use case. One of the benefits of CUDA is the ability to write a compatible program that can scale from 1-2 SM GPUs all the way to 100+ SM GPUs.