Question about warp execution and the warp scheduler

Hi!

I’m new to GPU architectures and to CUDA / parallel programming in general so please excuse my question if it’s too beginner for this sub.

For the context of my question, I’ll use the Blackwell architecture whitepaper (available here https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf). The figure 5 at page 11 shows the Blackwell Streaming Multiprocessor (SM) architecture diagram.

I do understand that warps are units of thread scheduling, in the Blackwell architecture they consist of 32 threads. I couldn’t find that information in the Blackwell whitepaper, but it is mentioned in “7.1 SIMT Architecture” in the latest CUDA C Programming Guide:

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.

We also learn about individual threads composing a warp:

Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.

And we learn about Independent Thread Scheduling:

Starting with the NVIDIA Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.

My question stems from me having a hard time reconciling the SIMT execution model of the warp and the independent thread scheduling. It’s easier to see if there is warp divergence, so it’s easy to see two “sub-warps” or SIMT units each executing single instructions on different group of threads for each execution path. But, I’m having a hard time understanding it outside of that context.

Let’s say I have a kernel that a FP32 addition operation. When the kernel is launched, blocks are assigned to SMs, and blocks are further divided into warps and these warps are assigned to the 4 warps schedulers that are available per SM.

In the case of the Blackwell SM, there are 128 CUDA cores. In the figure we see that they’re distributed over 4 (L0 cache + wrap scheduler + dispatch unit) groups, but that doesn’t matter, what matters are the 128 CUDA cores (and the 4 tensors cores, registers etc.) but for my toy example we can forget about the others I think.

If all resources are occupied, a warp will be scheduled for execution when resources are available. But what does it mean that resources are available or that a warp is ready for execution in this context? Does it mean that at least 1 CUDA core is available because now the scheduler can schedule threads independently? Or maybe N < 32 CUDA cores are available depending on some kind of performance heuristic it knows of?

I think my question is, does Independent Thread Scheduling mean that the scheduler can use all the available resources at any given time and use resources as they get available + some optimizations like in the case of warp divergence being able to execute different instructions though the warp is Single Instruction itself, like not having to do 2 “loops” over the warp just to execute two different paths. Or does it mean something else? If it’s exactly that, did the schedulers prior to Volta require that exactly 32 CUDA cores to be available (in this toy example and not in the general case where there is memory contention etc.)?

Thank you a lot!

Typically for most instructions the ‘sub-warp’ needs as many resources as if it would be a full warp. So it is not possible to diverge the threads of a warp and keep them running at the same time (instructions scheduled in the same cycle). Each ‘sub-warp’ would continue to run, but time-interleaved. Only one instruction of any warp can be scheduled per cycle per SMP (one of the four SM Partitions per SM).

Independent Thread Scheduling helps to avoid some cases of deadlocks, as if one sub-warp blocks, e.g. waiting for some condition, others can continue to run. It is less useful for trivially improving performance.

It is also not possible to use the arithmetic units (‘Cuda cores’) just for any lane (thread within a warp). They are often connected to only 2 lanes and can read or write only from or to the register file of those threads,

1 Like

Thanks a lot for your reply. So even if a warp diverges, the execution resources are allocated as if the full warp were executing and Independent Thread Scheduling doesn’t really eliminate (or maybe it reduces it in ways I’m not aware of) the performance penalty of warp divergence but improves thread execution since now they’re somewhat independent from the whole warp.

I also hadn’t realized that the arithmetic units are physically connected to specific lanes and their register files. It helps to know, thank you!

There are some ways, the resources may be reduced. E.g. with memory operations: Less threads could need less than the full L2/global memory bandwidth compared to 32 threads.

This seems (!) to be the case at least for the INT32 and FP32 units. But there are no published details from Nvidia.

That could be different for e.g. FP64 computations on non-datacenter GPUs. They are serialized (and thus slow) and could possibly profit (or have less penalty) from/with diverging warps.

1 Like

Thanks a lot again for your comment, it’s very nice!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.