Why is my register count limiting the active thread blocks per SM

This CUDA Occupancy Calculator tells me that you can have only one active thread block per multiprocessor given the following settings:
Compute Capability: 8.6
CUDA version: 11.1
Threads per block: 672
Registers per thread: 48
Shared memory per block: 256

I don’t understand why you couldn’t have two active thread blocks per SM since 672 * 2 * 48 = 64512, which is less than the total number of registers per SM (65536), and it’s also less than the max registers per block (65536). Those maximums are reported on that same Occupancy Calculator website.

According to the official occupancy calculator in Nsight Compute, CC 8.6 has the following properties regarding registers:

Register Allocation Unit Size: 256
Register Allocation Granularity: warp
Warp Allocation Granularity: 4

My interpretation is that the actual number of warps (21 = 672/32) is rounded up to the next multiple of 4, which is 24. 24 warps with 48 registers per thread require 36864 registers, which is greater than half the available registers. So only 1 thread block fits on a SM.

3 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.