I’m having a hard time understanding how and why the number of threads per block affects the number of warps per SM.
The way I understand it blocks are assigned to a single SM with potentially multiple blocks per SM. The threads in each block are then broken down into 32 thread warps to be executed on the SM. The maximum number of threads and blocks that can be on an SM will be limited by local (registers) and shared memory use.
I understand this to mean that the number of warps that can be scheduled on an SM will depend on the number of blocks executing and the number of threads per block. Increasing both will increase the number of warps that can be on an SM until some limiting factor is reached. Increasing the number of threads beyond this point would limit the number of blocks that can be executed but it would allow more warps from a single block to execute concurrently.
However, when I look at the data given by the profiler I see that the warps per SM follows a sort of sawtoothed pattern depending on the number of threads per block (like the image below). I don’t understand why this happens. In particular I don’t understand how there are sudden drops in the number of warps per SM at various points. I would assume that the warps are either filled in sequential order with the last warp getting the left over threads (if there are 70 threads, 2 warps of 32 and 1 of 6), or that warps are filled out evenly (if there are 70 threads, 2 warps of 23 1 warp of 24). I would assume the warps are filled in the first manner. That would imply that other some slack in the last warp the SM would have near 100% occupancy as you add more threads per block.
Can anyone explain where my reasoning is wrong? I am really confused by this.
Hmm, dropbox image linking doesn’t seem to be working.