Question about threads per block and warps per SM

I’m having a hard time understanding how and why the number of threads per block affects the number of warps per SM.

The way I understand it blocks are assigned to a single SM with potentially multiple blocks per SM. The threads in each block are then broken down into 32 thread warps to be executed on the SM. The maximum number of threads and blocks that can be on an SM will be limited by local (registers) and shared memory use.

I understand this to mean that the number of warps that can be scheduled on an SM will depend on the number of blocks executing and the number of threads per block. Increasing both will increase the number of warps that can be on an SM until some limiting factor is reached. Increasing the number of threads beyond this point would limit the number of blocks that can be executed but it would allow more warps from a single block to execute concurrently.

However, when I look at the data given by the profiler I see that the warps per SM follows a sort of sawtoothed pattern depending on the number of threads per block (like the image below). I don’t understand why this happens. In particular I don’t understand how there are sudden drops in the number of warps per SM at various points. I would assume that the warps are either filled in sequential order with the last warp getting the left over threads (if there are 70 threads, 2 warps of 32 and 1 of 6), or that warps are filled out evenly (if there are 70 threads, 2 warps of 23 1 warp of 24). I would assume the warps are filled in the first manner. That would imply that other some slack in the last warp the SM would have near 100% occupancy as you add more threads per block.

Can anyone explain where my reasoning is wrong? I am really confused by this.

Hmm, dropbox image linking doesn’t seem to be working.

Suppose I have 1024 threads per block. That is 32 warps. Most recent GPUs (excepting Turing) allow a hardware limit of 64 warps per SM, as well as 2048 threads per SM (these are consistent).

Ignoring other possible limiters, I could schedule up to 2 of these blocks on such an SM. That would give a complement of 64 warps (==2048 threads), a full load.

Now suppose I reduce my threadblock size to 992. That is 31 warps. I can still schedule at most 2 of these per SM (you cannot schedule 3- that would be 93 warps, or over 2900 threads). However, since I can schedule at most 2 of these, and each consists of 31 warps, the maximum warp load I can have is 62 warps - not 64.

If I further reduce my threadblock size, the maximum achievable warp load will continue to decrease, until such point at which scheduling 3 blocks becomes feasible. Then the maximum achievable warp load will “jump up” again. This process can repeat, as you continue to decrease threadblock size. The repetitions give rise to the “sawtooth” pattern.

Thanks for that.

I had to read through the explanation a few times while looking over the diagrams and the specs but I think I understand it now.

I always looked at it from increasing threadblock size and didn’t realize that a threadblock couldn’t be partially scheduled on an SM (rather a threadblock is all or none on an SM). Now it makes a lot more sense.

Yes, a threadblock is all or none, with respect to being scheduled on a SM by the block scheduler.

What does it mean scheduling blocks per SM?
I know about the concurrency within threadblock which means that 32 warps are toggled by SM.
Is it possible that SM performs context switching between two threadblocks?
Or were you meant to say that two threadblocks are scheduled to an SM-queue?

The GPU block scheduler may deposit multiple blocks on a single SM.

Multiple blocks can be resident on a SM, and the SM warp scheduler can choose, in any given clock cycle, among warps that belong to different threadblocks.

How else would we get to 64 warps per SM, the published hardware limit? (except Turing)

Understood, thank you.