What is Warp Allocation Granulatity for?

liusw · March 31, 2026, 2:04pm

ncu tell me this.

I am confused that according to Registers per Block, the block num limited by register should be 7, rather than 6 that is shown by ncu.

Warp Allocation Granularity maybe a reasonable explination to this. However, other topic( and CUDA_Occupancy_Calculater.xls ) shows that this Warp Allocation Granularity only affect SM_1.0 device.

So, what’s the reaseon

Ref: What is Warp Allocation Granulatity for?:

Nanodeoclus · March 31, 2026, 4:20pm

The register file per SM is not one big blob. Each warp scheduler (aka “SM partition” or “SMSP”, of which there are 4 per SM on all recent Nvidia GPUs) has 512 warp-wide general purpose registers.

If you’re using 136 registers per thread then each warp scheduler can fit 3 warps, for a total of 12 warps per SM (i.e. 6 threadblocks of 64 threads). 104 registers per scheduler remain available to other threadblocks with lower register requirements.

liusw · April 1, 2026, 3:38am

thanks you for your answer.

Accroding 65536 general registers / SM, ( which i recognized as 65536/SMSP），65536/(136*64)=7.5, the Unallocated Blocks should be 2 instead of 1.

new SM also confused me:

According to Programming Guide, “each block stay on a single SM”，which in fact SMSP？or a block’s different warp could run on different SMSP simultaneously?
Under the resource limitation, does two different could running on same SM simultaneously?
“Max Register per Multiprocesser is 65536” which is far from your 512/warp schuduler.
Does Warp Allocation Granularit = 4 is origin from 4 SMSP per SM？

Curefab · April 1, 2026, 1:17pm

The 512 is 65536 / 32 / 4.

So it is (registers per SM Partition) / (threads per warp)

or (warps per SM partition) * (registers per thread)

The warps of a block can be distributed on several SMSPs. That is even preferred for load balancing reasons.

Two different blocks can run on one SM, or one SMSP.

The Warp Allocation Granularity is also 4 by accident. They are at different ends of the calculation. (65536 / 32 / 4 SMSPs = 512; and those 512 again can only be distributed in multiples = granularity of 4).

Greg · April 2, 2026, 2:17am

The concept of warp groups in GH100 requires warps to be allocated in a round robin fashion to SMSP. A SM can support CU_DEVICE_ATTRIBUTE_MAX_BLOCKS_PER_MULTIPROCESSOR thread blocks per SM.

The Warp Allocation Granularity is also 4 by accident. They are at different ends of the calculation. (65536 / 32 / 4 SMSPs = 512; and those 512 again can only be distributed in multiples = granularity of 4).

I think you are referring to Warp Register Allocation Granularity which is 256 which means that registers per thread are allocated by in multiples of 8 (256 registers/warp / 32 threads/warp). I don’t recall this being less than 8 since Fermi architecture.

liusw · April 2, 2026, 10:55am

Thank you for you answer. Besides, any information aboud warp allocation granularity = 4 ?

Curefab · April 2, 2026, 12:27pm

Where does it say so? Could you post a reference or link?

striker159 · April 2, 2026, 12:36pm

Nsight Compute’s occupancy calculator displays the relevant data for each gpu arch.

Topic		Replies	Views
What is Warp Allocation Granulatity for? CUDA Programming and Performance	8	3278	March 21, 2017
Total Register Usage WarpAllocationGranularity & AllocationSize CUDA Programming and Performance	0	1070	March 29, 2010
What does "Allocation Granularity" mean in CUDA Occupancy Calculator CUDA Programming and Performance	1	11764	June 1, 2011
Is register allocation granularity per warp, in cc 5.2 and 6.0? CUDA Programming and Performance	2	803	September 29, 2016
register allocation behaviour CUDA Programming and Performance	2	478	January 9, 2019
Registers per SM GTX 460 CUDA Programming and Performance	7	2006	April 17, 2011
Max block size limiting factor CUDA Programming and Performance cuda , profiling	3	768	July 5, 2023
Grid size performance implications CUDA Programming and Performance	3	2968	October 10, 2009
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8434	September 11, 2009
Why is my register count limiting the active thread blocks per SM CUDA Programming and Performance	2	133	February 17, 2025

What is Warp Allocation Granulatity for?

Related topics