Question about threads per block and warps per SM

I’m having a hard time understanding how and why the number of threads per block affects the number of warps per SM.

The way I understand it blocks are assigned to a single SM with potentially multiple blocks per SM. The threads in each block are then broken down into 32 thread warps to be executed on the SM. The maximum number of threads and blocks that can be on an SM will be limited by local (registers) and shared memory use.

I understand this to mean that the number of warps that can be scheduled on an SM will depend on the number of blocks executing and the number of threads per block. Increasing both will increase the number of warps that can be on an SM until some limiting factor is reached. Increasing the number of threads beyond this point would limit the number of blocks that can be executed but it would allow more warps from a single block to execute concurrently.

However, when I look at the data given by the profiler I see that the warps per SM follows a sort of sawtoothed pattern depending on the number of threads per block (like the image below). I don’t understand why this happens. In particular I don’t understand how there are sudden drops in the number of warps per SM at various points. I would assume that the warps are either filled in sequential order with the last warp getting the left over threads (if there are 70 threads, 2 warps of 32 and 1 of 6), or that warps are filled out evenly (if there are 70 threads, 2 warps of 23 1 warp of 24). I would assume the warps are filled in the first manner. That would imply that other some slack in the last warp the SM would have near 100% occupancy as you add more threads per block.

Can anyone explain where my reasoning is wrong? I am really confused by this.

Hmm, dropbox image linking doesn’t seem to be working.
[url]Dropbox - File Deleted
External Image

Suppose I have 1024 threads per block. That is 32 warps. Most recent GPUs (excepting Turing) allow a hardware limit of 64 warps per SM, as well as 2048 threads per SM (these are consistent).

Ignoring other possible limiters, I could schedule up to 2 of these blocks on such an SM. That would give a complement of 64 warps (==2048 threads), a full load.

Now suppose I reduce my threadblock size to 992. That is 31 warps. I can still schedule at most 2 of these per SM (you cannot schedule 3- that would be 93 warps, or over 2900 threads). However, since I can schedule at most 2 of these, and each consists of 31 warps, the maximum warp load I can have is 62 warps - not 64.

If I further reduce my threadblock size, the maximum achievable warp load will continue to decrease, until such point at which scheduling 3 blocks becomes feasible. Then the maximum achievable warp load will “jump up” again. This process can repeat, as you continue to decrease threadblock size. The repetitions give rise to the “sawtooth” pattern.

3 Likes

Thanks for that.

I had to read through the explanation a few times while looking over the diagrams and the specs but I think I understand it now.

I always looked at it from increasing threadblock size and didn’t realize that a threadblock couldn’t be partially scheduled on an SM (rather a threadblock is all or none on an SM). Now it makes a lot more sense.

Yes, a threadblock is all or none, with respect to being scheduled on a SM by the block scheduler.

1 Like

What does it mean scheduling blocks per SM?
I know about the concurrency within threadblock which means that 32 warps are toggled by SM.
Is it possible that SM performs context switching between two threadblocks?
Or were you meant to say that two threadblocks are scheduled to an SM-queue?

The GPU block scheduler may deposit multiple blocks on a single SM.

Multiple blocks can be resident on a SM, and the SM warp scheduler can choose, in any given clock cycle, among warps that belong to different threadblocks.

How else would we get to 64 warps per SM, the published hardware limit? (except Turing)

1 Like

Understood, thank you.

Hi, what is meant by the sentence " may deposit multiple blocks on a single SM". Are these small code instruction that are queued in GPU queue. The term blocks here is confusing.

1 Like

If you’ve never done any CUDA programming at all, that may not be very clear. It will be clear to a CUDA programmer. When you launch a CUDA kernel, you define a total number of threads that will execute the kernel. However this total number of threads (the grid) is defined hierarchically - using two numbers, the first specifying blocks per grid, and the second specifying threads per block. It is these blocks that are being referred to.

You may wish to get an orderly introduction to CUDA programming here.

2 Likes

Yes, partial scheduling of a threadblock is not possible.

A threadblock needs resources (number of threads, number of registers) for full warps. Even if a threadblock has 33 threads, it needs the resources of 64 threads - 2 full warps.

Also there are only certain possible fixed numbers of registers a thread can be assigned - typically numbers divisible by 4 (except for the maximum 255 presumably to leave space for RZ, the zero register).

This all lead to suboptimal resource usage for some divisions and thus a sawtooth pattern.

1 Like

Thanks for quick reply. You are a pal. I am half way through teaching a course on CUDA and was trying to find how GPU runtime command processor works. I have found a slide on AMD

.
I was wondering if I can find similar slide on NVIDIA.

I meant, is this GPU block scheduler the guy who decides what instruction(s) to assign (or deposit as you said) to multiple blocks. Thanks. I completely understand, blocks, warps, half-warps, shared memory etc.

This may be of interest for background.

Before a kernel has been launched, a thread block is largely just an idea. It consists of the the code each thread will execute, along with the size of the block (number of threads per block).

When the kernel is launched in host code (at the point of <<<...>>>), then the host code library (libcudart) will make that kernel idea/definition available in a queue for scheduling on the device.

Once all ordering requirements (CUDA stream ordering, for example) have been met, the kernel will be transferred to another queue. It is largely still an idea or definition at this point. This queue is the input queue to the block scheduler.

The block scheduler is a hardware entity that looks at items in the aforementioned queue, and decides when and where to deposit blocks from those kernel “definitions”, onto specific SMs in the device. When all the blocks from a kernel “definition” have been deposited, the kernel “definition” is removed from the queue.

When a block is deposited on a SM, a number of things happen. Among those are included reservations for the various resources that the block will require. These resources include warp slots, registers, and shared memory, amongst others. The reservation of warp slots on a modern GPU is static, amongst the warp schedulers. If a SM has 4 warp schedulers, the warps from a newly deposited block will be statically allocated amongst the 4 warp schedulers. If there is no other activity on the SM at that point, we would presume that the warps would be evenly divided. Static here means that warp ownership does not move from one warp scheduler to another warp scheduler during the lifetime of the warp.

Once a warp slot is occupied by a particular warp, then the warp scheduler is free to select the next instruction from that warp, and issue it to functional unit resources in that SM (more specifically, in that SMSP - the SM sub-partition associated with that warp scheduler). This “last” phase of scheduling activity is covered in some detail in unit 3 of this course. I encourage you to review that.

The compiler determines what instructions are part of the kernel’s thread code. For this specific part of the discussion, a block is nothing more than a set of threads. The block doesn’t have any instructions, nor any instruction ordering, that is in any way separate from the thread instructions and the thread instruction ordering that is pre-determined by the compiler. The job of the block scheduler is to “deposit” a block from a kernel in the incoming “ready kernel queue”, onto a particular SM. The block scheduler makes no decisions about which instruction to schedule at any point. The warp scheduler is the one that schedules (i.e. issues) an instruction, from a warp, onto execution resources that can handle that instruction type.

A lot of this information is not formally documented by NVIDIA, and much of it is unspecified. I’m not likely to proceed much further in any sort of detailed description of the steps I outlined above. Any of this is subject to change in the future.

1 Like

I completely understand now and your explanation has cleared a lot of “mist”. Thank you.