What is Warp Allocation Granulatity for?

ncu tell me this.

I am confused that according to Registers per Block, the block num limited by register should be 7, rather than 6 that is shown by ncu.

Warp Allocation Granularity maybe a reasonable explination to this. However, other topic( and CUDA_Occupancy_Calculater.xls ) shows that this Warp Allocation Granularity only affect SM_1.0 device.

So, what’s the reaseon

Ref: What is Warp Allocation Granulatity for?:

The register file per SM is not one big blob. Each warp scheduler (aka “SM partition” or “SMSP”, of which there are 4 per SM on all recent Nvidia GPUs) has 512 warp-wide general purpose registers.

If you’re using 136 registers per thread then each warp scheduler can fit 3 warps, for a total of 12 warps per SM (i.e. 6 threadblocks of 64 threads). 104 registers per scheduler remain available to other threadblocks with lower register requirements.

1 Like

thanks you for your answer.

Accroding 65536 general registers / SM, ( which i recognized as 65536/SMSP),65536/(136*64)=7.5, the Unallocated Blocks should be 2 instead of 1.

new SM also confused me:

  1. According to Programming Guide, “each block stay on a single SM”,which in fact SMSP?or a block’s different warp could run on different SMSP simultaneously?
  2. Under the resource limitation, does two different could running on same SM simultaneously?
  3. “Max Register per Multiprocesser is 65536” which is far from your 512/warp schuduler.
  4. Does Warp Allocation Granularit = 4 is origin from 4 SMSP per SM?

The 512 is 65536 / 32 / 4.

So it is (registers per SM Partition) / (threads per warp)

or (warps per SM partition) * (registers per thread)

The warps of a block can be distributed on several SMSPs. That is even preferred for load balancing reasons.

Two different blocks can run on one SM, or one SMSP.

The Warp Allocation Granularity is also 4 by accident. They are at different ends of the calculation. (65536 / 32 / 4 SMSPs = 512; and those 512 again can only be distributed in multiples = granularity of 4).

The concept of warp groups in GH100 requires warps to be allocated in a round robin fashion to SMSP. A SM can support CU_DEVICE_ATTRIBUTE_MAX_BLOCKS_PER_MULTIPROCESSOR thread blocks per SM.

The Warp Allocation Granularity is also 4 by accident. They are at different ends of the calculation. (65536 / 32 / 4 SMSPs = 512; and those 512 again can only be distributed in multiples = granularity of 4).

I think you are referring to Warp Register Allocation Granularity which is 256 which means that registers per thread are allocated by in multiples of 8 (256 registers/warp / 32 threads/warp). I don’t recall this being less than 8 since Fermi architecture.

1 Like

Thank you for you answer. Besides, any information aboud warp allocation granularity = 4 ?

Where does it say so? Could you post a reference or link?

Nsight Compute’s occupancy calculator displays the relevant data for each gpu arch.

Register Allocation Unit Size = 256 for all archs

Register Allocation Granularity = warp for all archs

Shared Memory Allocation Unit Size = 256 for arch <= 7.5, else 128

Warp Allocation Granularity = 2 for arch 5.2, 5.3, 6.0, else 4

Warp Register Allocation Granularity = 256