In CUDA Occupancy Calculator, there’s a quantity “Warp Allocation Granulatity” which is 4 for CUDA compute capability 3.5. I tried to find out what it means, and found some answers that indicated the device only allocates resources for multiples of this quantity. However, both the CUDA Occupancy Calculator and my own experiments (on a K40c card) show registers are allocated on a per warp basis (and not multiples of 4 warps).
For instance, for a kernel with block size of 64, one would expect that with 74 registers per thread, each SM can host 6 blocks at the same time if the assumption about this granularity were true. But CUDA Occupancy Calculator shows that each SM can run 12 blocks simultaneously.
I checked this on the device as well, by printing smid for each block. Since the scheduler assigns blocks to SMs in a round-robin fashion, until there are less blocks than the device can run simultaneously, all the SMs get the same number of blocks. However, when there are more blocks than the device can run at the same time, not all SMs get the same number of blocks assigned to them. Because waiting blocks are assigned to an SM as soon as it has enough free resources to host one. My observations using this method were consistent with what CUDA Occupancy Calculator shows. (This doesn’t technically prove anything, though).
Then what is “Warp Allocation Granulatity” for in practice?