Stephen Jones mentions in a GTC talk (at 32:35) that the number of threads per CTA should always be at least 128 = 32 * 4 because the SM can issue instructions to up to 4 warps per cycle. I can’t tell if he’s implying that a constraint on the SM is that those 4 warps have to be part of the same CTA. Is that indeed a constraint, or if not, perhaps there is a preference for co-scheduling warps from the same CTA?
More generally, I’d like to understand scheduling better, and would love pointers to written references or talks. Thanks
Which GTC talk, pertaining to which GPU architecture(s)? I doubt this applies universally to all GPU architectures currently supported by CUDA, but I don’t have a complete overview.
1 Like
A warp scheduler in a modern GPU such as Volta or newer, can choose from among any warps that are assigned to it, to issue instructions, on a cycle-by-cycle basis.
That means if the warp scheduler has warps assigned to it from 2 or more different CTA’s (thread blocks), then indeed the warp scheduler could pick a warp (instruction) from one threadblock to schedule, and in the very next cycle could pick a warp (instruction) from another threadblock to schedule.
Since volta and newer are not dual-issue capable schedulers, that is the closest you can get to “coscheduled”, when considering only a single warp scheduler. If we consider multiple warp schedulers in the same SM, then it is also true that in a given clock cycle, that one warp scheduler could schedule an instruction from one CTA, whereas in the same clock cycle another warp scheduler schedules an instruction from another CTA.
The statement about groups of 4 is referring to the idea that an SM may have up to 4 warp schedulers. If it has 4 warp schedulers, and your threadblock has, say, 2 warps, then it is guaranteed that half of your issue capacity goes unused, if that is the only “resident” of that SM. Of course you can make up for this by having more threadblocks deposited on each SM.
There is no requirement that in a single/given cycle, each of the 4 warp schedulers must choose a warp/instruction from the same CTA.
This sort of thing isn’t documented at the CUDA C++ level - it is mostly an implementation detail. Therefore the places where you may find it discussed are in forum posts like this one, GTC talks, microbenchmarking papers, and perhaps in architecture whitepapers for specific GPU arch families.
1 Like
Each SM has 4 SM Partitions, each with a separate warp scheduler scheduling up to 1 instruction/cycle. Warps are assigned to the SM Partitions (in theory this assignment can change later in special circumstances, but this would take a performance toll, so assume they stay in their partitions; an exception could be Dynamic Parallelism, invoking kernels from the device side). This assignment of warps to partitions is not limited by to which CTAs a warp belongs. Typically the assignment to partitions is balanced to achieve equal occupation.
1 Like
Why would it take a performance toll to move the warp to another partition? Maybe there is some state maintained on the hardware that would need to be moved? Thanks
Some resources like the registers are specific for a SM Partition. Also the execution units (e.g. FP32) are pipelines. To move a warp to another partition, the pipelines have to be drained and all the registers have to be moved. We are talking about a few hundred up till a few thousand cycles, what it would theoretically take. Nvidia does not give any guarantee for a warp to stay on the same SM Partition, but in practice they do.
1 Like