Understanding warp scheduling on a Streaming multiprocessor

I have a question about warp scheduling on streaming multiprocessor.
Lets say we have 4 blocks each divided into 2 warps, and we have a streaming multiprocessor which has 4 processing bocks. Are the warps assigned to the processing blocks at the begin itself saying 2 warps per processing block, or is it the case that whichever processing block completes execution - it will take the next warp that is ready
so can there be a case where 3 warps are executed by 1 processing bock and 2warps by 2 processing blocks and 1 warp by 1 processing block inside the same streaming multiprocessor

This explanation by Greg may help.

1 Like

warps are statically assigned to the SMSPs (the “processing blocks”) that make up an SM, at the point at which the threadblock is deposited on the SM by the CWD/block scheduler.

A warp can only be issued by the warp scheduler associated with its SMSP.

It’s possible that one SMSP in a SM could have say 4 warps assigned, all of which are stalled, and therefore have nothing to issue in a particular cycle, whereas at the same time, in the same cycle, in the same SM, in a different SMSP, the warp scheduler there could have multiple “eligible” warps, and would only be able to issue one of them.

1 Like

If the SM is empty at the time of kernel invocation and the 4 blocks with 8 warps fit on the SM, they will probably be evenly distributed onto the SM Partitions.

From a performance viewpoint that should be true.

From a correctness viewpoint I would not rely on it.

The 2 warps of a block will typically be distributed to different SM partitions to better balance the load (instead of each SM partition the two warps of one block). Architecture-wise the shared resources of a block (e.g. shared memory) are also shared between the SM partitions.

If the 4 blocks with 8 warps don’t fit on the SM at the same time, then the blocks are scheduled in waves. Whichever SM partition has free resources will take up warps. AFAIK this is not predetermined.

Example: 2 blocks with 4 warps (2 warps per block) are started/scheduled on the 4 SM partitions. The SM partitions 0 and 1 finish. A new block is scheduled there. It finishes again. The 4th block is also scheduled there.

Now SM Partitions 0 and 1 executed 3 warps each, SM Partition 2 and 3 executed 1 warp each.

So your second understanding is correct.

1 Like