I am looking into performance interference problem among co-running processes on a single GPU. So I need learn more about GPU context switch. This question is targeting the following scenario: two processes (e.g., two tensorflow object detection applications/processes running on a single GPU, Nvida Tesla P100 CUDA version 10.1 Driver Version 418.40.04 on Ubuntu16). I am NOT using nvidia multi-process service:https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
My question is about how the cudaContexts of different processes are scheduled to execute on a single GPU (by leveraging time slicing as far as I know). Specifically:
(1) What is the scheduling policy?
I read some papers/documents. They mention the scheduling policy is FIFO: the cuda+driver maintain a single queue holding all pending kernel execution requests, as long as the kernel in front of the queue belongs to a different cudaContext than the current running cudaContext, a gpu context switch is invoked. Is this right?
(2) What are the “scheduling resources” (mentioned in 2.1.3 of this nvidia document:https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf) that need to be swapped off and on GPU during a context switch? In another word, where does the gpu context switch overhead come from?
(3)Will the gpu memory allocated to a process survive a context switch (won’t be swapped off GPU chip)? I guess the GPU memory allocated to a process will always be residing on GPU as long as the process is running.
I know the context switch mechanism of multiple processes on gpu is involved. I just want to know the principles, e.g., does the scheduling policy follow FIFO rules or some “fairness” rules.