Don’t mix preemption and CUDA stream priority together.
The programmer has no direct control over preemption. The programmer has direct control over stream priority. Furthermore, stream priority activity does not imply that any preemption is occurring.
The documented forms of preemption take place in
- Certain debug situations https://docs.nvidia.com/cuda/cuda-gdb/index.html#single-gpu-debugging-with-desktop-manager-running
- Servicing of certain CDP (CUDA Dynamic Parallelism) patterns https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#sm-id-and-warp-id
Preemption may occur in other settings (e.g. in time-sliced servicing of multiple independent clients, i.e. multiple independent processes using a single GPU without MPS) but these are basically undocumented.
Preemption the way I am using it here (i.e. in the named, documented uses) refers to the idea that a threadblock that is running may be halted and moved out of an SM, before it retires.
Stream priority does not require preemption in order to deliver its stated function, and as far as I know there is no documentation that says that stream priority mechanism will use preemption in any setting.
(Yes, as noted below, the documentation now states that preemption may be used. may is not the same as will.)
Stream priority says that the GPU block scheduler, when depositing blocks to be run on various SMs, will choose blocks from higher priority streams before it chooses blocks from lower priority streams. The stream priority mechanism makes no claim that I am aware of, that a block that has been deposited on a SM will be preempted (i.e. removed from that SM) to make room for another block.
The basic CUDA (threadblock) execution model is that a threadblock, once deposited on a SM, will remain on that SM until it completes execution and retires.
Don’t mix preemption with stream priority. There is no valid or documented reason to do so. A kernel’s priority is decided according to the priority of the stream it is launched into. There is no other mechanism.
I don’t know what that means, it sounds strange
I don’t know what that means, it sounds strange
yes
Don’t mix preemption with stream priority. The CUDA runtime may preempt running kernels for certain CDP needs and in certain debug scenarios. Other uses of preemption by the CUDA runtime are undocumented AFAIK.
Don’t mix preemption with stream priority. High priority kernels will not necessarily preempt low priority kernels. Blocks from high priority kernels receive scheduling priority over blocks from low-priority kernels, but this only applies to blocks which have not yet been scheduled by the GPU block scheduler. Although the CUDA runtime may preempt low priority kernel blocks for high priority kernel blocks, there are no stated conditions under which such behavior is guaranteed, AFAIK.
The observation you describe is therefore certainly possible based on the kernel launch order, and what other intervening activity there may be. If a kernel is launched into a lower-priority stream first, and sometime later a kernel is launched into a higher priority stream, blocks from the higher priority kernel will not begin to execute until the GPU block scheduler finds available space to deposit them on the SM(s). CUDA provides no guarantee of preemption or any other mechanism in such a case to guarantee that blocks from the higher priority kernel will immediately execute upon launch.
Your question is unclear, but I don’t know how to respond in either case:
Maxwell and Pascal arch have different behavior from each other? Its not documented in any way that relates to stream priority that I am aware of.
Maxwell and Pascal arch have different behavior from other architectures? Its not documented in any way that relates to stream priority that I am aware of.
Notice the principal documentation of CUDA stream priority in the CUDA programming guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
The only runtime functional/behavioral description given is as follows:
“At runtime, as blocks in low-priority schemes finish, waiting blocks in higher-priority streams are scheduled in their place.”
This is a terse description of the functional behavior I have given. Note the use of the word “schemes” here is a typo, it should be “streams”.