Can warps from different CTAs be coscheduled?

ericauld · July 5, 2024, 5:54pm

Stephen Jones mentions in a GTC talk (at 32:35) that the number of threads per CTA should always be at least 128 = 32 * 4 because the SM can issue instructions to up to 4 warps per cycle. I can’t tell if he’s implying that a constraint on the SM is that those 4 warps have to be part of the same CTA. Is that indeed a constraint, or if not, perhaps there is a preference for co-scheduling warps from the same CTA?

More generally, I’d like to understand scheduling better, and would love pointers to written references or talks. Thanks

njuffa · July 5, 2024, 6:35pm

Which GTC talk, pertaining to which GPU architecture(s)? I doubt this applies universally to all GPU architectures currently supported by CUDA, but I don’t have a complete overview.

Robert_Crovella · July 5, 2024, 6:52pm

A warp scheduler in a modern GPU such as Volta or newer, can choose from among any warps that are assigned to it, to issue instructions, on a cycle-by-cycle basis.

That means if the warp scheduler has warps assigned to it from 2 or more different CTA’s (thread blocks), then indeed the warp scheduler could pick a warp (instruction) from one threadblock to schedule, and in the very next cycle could pick a warp (instruction) from another threadblock to schedule.

Since volta and newer are not dual-issue capable schedulers, that is the closest you can get to “coscheduled”, when considering only a single warp scheduler. If we consider multiple warp schedulers in the same SM, then it is also true that in a given clock cycle, that one warp scheduler could schedule an instruction from one CTA, whereas in the same clock cycle another warp scheduler schedules an instruction from another CTA.

The statement about groups of 4 is referring to the idea that an SM may have up to 4 warp schedulers. If it has 4 warp schedulers, and your threadblock has, say, 2 warps, then it is guaranteed that half of your issue capacity goes unused, if that is the only “resident” of that SM. Of course you can make up for this by having more threadblocks deposited on each SM.

There is no requirement that in a single/given cycle, each of the 4 warp schedulers must choose a warp/instruction from the same CTA.

This sort of thing isn’t documented at the CUDA C++ level - it is mostly an implementation detail. Therefore the places where you may find it discussed are in forum posts like this one, GTC talks, microbenchmarking papers, and perhaps in architecture whitepapers for specific GPU arch families.

Curefab · July 6, 2024, 11:11am

Each SM has 4 SM Partitions, each with a separate warp scheduler scheduling up to 1 instruction/cycle. Warps are assigned to the SM Partitions (in theory this assignment can change later in special circumstances, but this would take a performance toll, so assume they stay in their partitions; an exception could be Dynamic Parallelism, invoking kernels from the device side). This assignment of warps to partitions is not limited by to which CTAs a warp belongs. Typically the assignment to partitions is balanced to achieve equal occupation.

ericauld · July 6, 2024, 6:41pm

Why would it take a performance toll to move the warp to another partition? Maybe there is some state maintained on the hardware that would need to be moved? Thanks

Curefab · July 6, 2024, 8:45pm

Some resources like the registers are specific for a SM Partition. Also the execution units (e.g. FP32) are pipelines. To move a warp to another partition, the pipelines have to be drained and all the registers have to be moved. We are talking about a few hundred up till a few thousand cycles, what it would theoretically take. Nvidia does not give any guarantee for a warp to stay on the same SM Partition, but in practice they do.

Topic		Replies	Views
Question about threads per block and warps per SM CUDA Programming and Performance	13	16212	October 6, 2022
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28706	July 4, 2019
some doubts about the task scheduling of NVIDIA GPU CUDA Programming and Performance	6	2184	May 26, 2017
Thread Scheduling Concept CUDA Programming and Performance	3	3716	June 21, 2012
Scheduling Warps of different kernels in the same cycle on the same SM CUDA Programming and Performance	6	114	December 6, 2024
Basic question about warps CUDA Programming and Performance	14	6590	June 9, 2009
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8324	September 11, 2009
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	815	June 17, 2024
How to the A100 GPU’s maximum warps per scheduler CUDA Programming and Performance	3	362	July 17, 2024
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2172	March 19, 2011

Can warps from different CTAs be coscheduled?

Related topics