My understanding is that if I issue N blocks and there are n<N SMs on my GPU, these N blocks will wait in a queue for available SMs.
So even if the work load of each block is quite different, the GPU SMs are still always busy if there is enough blocks (assuming threads doesn’t stall).
Is this correct?
E.g., 4 blocks * 32 threads/block, and
block 0 does 32 addings,
block 1 does 64 addings,
block 2 does 64 addings,
block 3 does 32 addings,
and there are 2 SMs on GPU, the work load of SMs are still balanced though the work load of block are not. Right?