This CUDA Occupancy Calculator tells me that you can have only one active thread block per multiprocessor given the following settings:
Compute Capability: 8.6
CUDA version: 11.1
Threads per block: 672
Registers per thread: 48
Shared memory per block: 256
I don’t understand why you couldn’t have two active thread blocks per SM since 672 * 2 * 48 = 64512, which is less than the total number of registers per SM (65536), and it’s also less than the max registers per block (65536). Those maximums are reported on that same Occupancy Calculator website.
My interpretation is that the actual number of warps (21 = 672/32) is rounded up to the next multiple of 4, which is 24. 24 warps with 48 registers per thread require 36864 registers, which is greater than half the available registers. So only 1 thread block fits on a SM.