So I am running an application that requires a GPU, and I notice this phenomenon where the GPU has nothing else to do, but the CUDA context stays on the device for about 1ms before being switched out.
I observed this phenomenon using Nsight Systems, shown below:
One can see that, after the GPU has nothing left to run, the process flow is transferred from the GPU to the DLA. However, the GPU still keeps that CUDA context on the device for a short while of 800us. (My program requires both the GPU and the DLA to compute a neural network.)
So what exactly is the GPU’s context switch strategy?