Why is max threads per sm larger than max threads per block?

maarten.arnst1 · January 4, 2024, 1:35am

On our Volta70 gpu, the max number of treads per sm is 2048, while the max number of threads per block is 1024.

It seems that the max number of threads per sm follows from hardware limitations (4 schedulers, 16 warps per scheduler, 32 threads per warp).

However, it is not immediately clear why the max number of threads per block is smaller. Do you know what explains it best? Thanks.

Robert_Crovella · January 4, 2024, 2:46am

CUDA GPU SMs can have more than 1 block resident. The maximum number of threads per SM is a limit across all resident threadblocks. This number being 2048 means, for example, that 2 threadblocks of 1024 threads each could possibly be resident.

The CUDA threadblock limit of 1024 threads is pretty much a standard part of CUDA for at least the last 10 years. Compute capability 2.0 devices until current devices all have this same limit. I wouldn’t be able to offer any further explanation than that.

(cc1.x devices had a limit of 512 threads per block.)

Curefab · January 5, 2024, 6:46pm

Just guessing here:

It could be related to block-wide synchronization for certain instructions (needing more physical data lines, shadow registers, bits, …). It could also be an artificial limitation to keep compatibility of kernels to other architectures.

Greg · January 5, 2024, 11:30pm

The maximum number of threads per block is an architectural decision based upon area cost and compatibility. Increasing the dimensions increase wires and increase area cost. This is not something that software developers generally consider but is critical in efficient hardware design. Increasing the maximum number of threads per block on some architectures (e.g. 100 class) would mean either lower end parts also need to increase the value or there will be compatibility issues. The graphics/mobile focused GPUs tend to have less threads per SM to reduce cost and still provide optimal resources for typical use case. One of the benefits of CUDA is the ability to write a compatible program that can scale from 1-2 SM GPUs all the way to 100+ SM GPUs.

Topic		Replies	Views
Relationship between Warp and Thread Block on SM CUDA Programming and Performance cuda	2	531	November 10, 2023
Scheduling Thread Blocks CUDA Programming and Performance	5	1238	July 29, 2021
maximum threads per block not always used CUDA Programming and Performance	2	754	June 14, 2018
Max threads/blocks CUDA Programming and Performance	10	90	September 6, 2024
Maximum number of threads per block CUDA Programming and Performance	1	463	September 15, 2021
confusion of basic concepts CUDA Programming and Performance	8	6308	May 18, 2011
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27754	February 15, 2010
Limit of 8 blocks per SM in CUDA CUDA Programming and Performance	5	3410	February 7, 2011
Thread Scheduling / Limit maximum threads per block in each dimension vs Maximum thread on a SM CUDA Programming and Performance	3	1760	June 21, 2012
Maximum Number of Warps and Warp Size per SM CUDA Programming and Performance cuda , gpu , architecture-and-design	5	7582	November 30, 2022

Why is max threads per sm larger than max threads per block?

Related topics