What is Warp Allocation Granulatity for?


In CUDA Occupancy Calculator, there’s a quantity “Warp Allocation Granulatity” which is 4 for CUDA compute capability 3.5. I tried to find out what it means, and found some answers that indicated the device only allocates resources for multiples of this quantity. However, both the CUDA Occupancy Calculator and my own experiments (on a K40c card) show registers are allocated on a per warp basis (and not multiples of 4 warps).

For instance, for a kernel with block size of 64, one would expect that with 74 registers per thread, each SM can host 6 blocks at the same time if the assumption about this granularity were true. But CUDA Occupancy Calculator shows that each SM can run 12 blocks simultaneously.

I checked this on the device as well, by printing smid for each block. Since the scheduler assigns blocks to SMs in a round-robin fashion, until there are less blocks than the device can run simultaneously, all the SMs get the same number of blocks. However, when there are more blocks than the device can run at the same time, not all SMs get the same number of blocks assigned to them. Because waiting blocks are assigned to an SM as soon as it has enough free resources to host one. My observations using this method were consistent with what CUDA Occupancy Calculator shows. (This doesn’t technically prove anything, though).

Then what is “Warp Allocation Granulatity” for in practice?

may be it’s a granularity of register allocation per warp? i.e. if you need 74 regs - 76 are allocated since 76%4==0

While this isn’t documented in the Programming Guide anymore, there is no need for guessing. The Occupancy Calculator spreadsheet has all formulas in it to calculate occupancy for architectures up to Maxwell.

Warp allocation granularity is used in one place: The number of warps in the calculation of total register consumption is rounded up to be a multiple of the Warp Allocation Granularity.
Further looking at the “GPU Data” sheet of the spreadsheet, one can notice that the Warp Allocation Granularity coincides with the number of warp schedulers. As most likely each warp scheduler has it’s own register banks, it appears that Nvidia just uses the same configuration in all warp schedulers.

This is exactly what’s confusing me. I found the reference in the “Calculator” sheet, but it’s only used if the register allocation granularity is block. However, for all the SM types in the “GPU Data” sheet this granularity is “warp” and none has “block” granularity. Is this something that can be changed somewhere?

It’s not related to my initial question, but I think the number of warp schedulers isn’t included in the “GPU Data” sheet.

Compute capability 1.x devices had Block Allocation Granularity, with a granularity of 2 warps.

Indeed. It needs to be taken from other sources, e.g. Appendix G of the Programming Guide

So if the allocation granularity (block vs. warp) is fixed for every device, then why is warp allocation granularity defined for devices that allocate resources at warp granularity (and it even has different values, 2 for sm_20, sm_21, and sm_60, and 4 for the rest)?

I’d suspect it’s for historic reasons. But I guess only someone inside Nvidia can answer this.

Greg Smith says that “If the kernel configuration is N then the hardware allocates resources for N rounded up to a multiple of WarpAllocationGranularity” in http://stackoverflow.com/questions/24940448/what-is-the-warp-allocation-granularity-and-what-purpose-does-it-serve-in-cud. My understanding is that if kernel has N warps, it would be rounded up to the next multiple of WarpAllocationGranularity.

I tried to write code to test if N is rounded up to a multiple of WarpAllocationGranularity.

The block size of my kernel is 96 (32x3) with total number of 11 thread blocks. Essentially, it has 33 warps. The kernel run on Nvidia Tesla C2075 (SM 2.0) with 14 multiprocessors. The print function inside the kernel to print out the warp_id. It turns out that the kernel only has 33 warps instead of 34 ,which is the next multiple of WarpAllocationGranularity (2 for SM 2.0).

Since Tesla C2075 has 14 multiprocessors, I also allocated 15 threads block. In this case, it has 45 (3x15) warps. Similarly, the kernel allocates 45 warps instead of 46.

P.S. the kernel is simply a vector addition, it does not exceed the register and shared memory limit.

Probably I does not understand this concept well, any input is welcome.

This is what I thought in the beginning as well, but it’s not the case. If you go into Occupancy Calculator and check the equations for occupancy measurement, you’ll find that Warp Allocation Granularity isn’t used for any of the devices past sm_20. So the resources are allocated per warp.

P.S. Even if Warp Allocation Granularity were to be used, I think you should’ve expected a different behavior that what you explained. I think in that case, warps in each block (and not the total number of warps) would’ve been rounded up to a multiply of Warp Allocation Granularity. But that’s just a pointless guess.