I have many questions on how a CUDA device is shared amongst competing processes. I haven’t been able to find much information on how this is actually done.
Does anyone have any information (or educated guesses) on how CUDA executes/schedules kernels from different processes (or contexts?)? Basically, how is the CUDA device shared amongst the different processes? I understand that a CUDA device can be placed in exclusive-use mode but I’m more interested in how it is shared.
Are kernels queued up on the driver/host side or on the device? Is there any ordering to the queue (regardless of host vs device side), or is it merely FIFO? If there is an ordering, is it ordered by job size parameters and/or how long a kernel has been waiting to execute? How about fairness issues where a process with many kernels could overwhelm others?
Finally, with Fermi’s multi-kernel execution (of kernels from the same context), any speculation on what might prevent starvation of other tasks? I’m thinking of a case where a context could keep its “foot in the door” to feed in more kernels.