CUDA doesn’t say very much at all about the scheduler design, so we have to do detective work and see what fits the facts we DO know.
One reasonable scheduler design is that the order of warps is DYNAMIC… it’s a circular queue. Some warps get removed from the list because they’re stalled (for memory access usually). But when the memory is finally delivered, the warp gets put back into the active queue.
The alternative design would be for the scheduling order to be FIXED, like Anjul’s list in the post above, and the scheduler SKIPS inactive warps. That’s more work for the scheduler, since if one warp has to be skipped, the scheduler has to check if the NEXT warp is OK, and if THAT has to be skipped, it has to go to the NEXT warp, etc. And the scheduler has to do this FAST… effectively in less than one clock. So the variable “skip” length makes that zero-cost scheduling much harder to implement.
A dynamic active warp list makes it easier for the scheduler (time wise) since the only decision that needs to be made is whether to remove the warp from the active list or not.
Now what other clues do we have to confirm the design?
The scheduler can only handle so many warps… 32 in G200. That implies that maximum is the maximum size of that circular buffer. If the order was fixed, then there would likely be no scheduler warp limit. (You’d still be limited in practice by registers of course.)
Another big clue: “syncthreads()” is cheap. If the scheduler uses a circular list, syncthreads basically would just pop out any warp that hits the barrier from the active list, then when the active list is empty (including pending memory warps), the active warp list gets repopulated and starts up again.
That’s cheap. With a fixed order list, then syncthreads involves marking each warp with status flags and checking for them all to be set (and skipping over the waiting ones every time, and then clearing those bits after all the warps finish their flagging. Again, not expensive, but also not “only 4 clocks used” like the manual says.
So, supported by these clues, I think the GPU’s scheduler uses a variable length, dynamic, circular list, and therefore the answer to “what order are the warps scheduled?” is likely “effectively randomly, in a circular, but dynamic, order.”
Other interesting clue, just for brainstorming:
Half-warps of 16 threads seem to be a natural fit to the scheduling size, it’s the memory transaction size and it’s also even mentioned in the manual obliquely that threads are grouped into bundles of 16. So it’s quite possible that the hardware is capable of scheduling quantums of HALF warps of 16 threads! So why make warps 32? Because of that scheduling queue size… if you had a quantum size of 16 threads per warp, you’d need double the scheduling slots and therefore more scheduling hardware… It’s likely a design decision whether 16 or 32 (or 8 or 64) is most efficient for actual computation. It seems logical that smaller the better, (hey, make every thread independent, right?) but that’s actually wrong, you’d have to double, or quadruple, the scheduling hardware. And remember how well that hardware works… there’s NO scheduling overhead, syncthreads is cheap, it’s all automatic. It may be expensive (transistor wise) to scale that up. Possible of course, but is it worth the tradeoff?