concurrent execution of cuda kernels from different contexts

https://docs.nvidia.com/deploy/mps/index.html#topic_4_1 says

“GPU’s with Hyper-Q have a concurrent scheduler to schedule work from work queues belonging to a single CUDA context.”
but when I profiled multiple processes with visual profiler, kernerls from different contexts execute concurrently although I don’t use Multi-Process Sevice(mps).

also, The document of mps explained that it permits CUDA kernels execute simultaneously to achieve higher utilization. However, When I use Multi-Process Service(mps), visual profiler shows that mps allows kernels to overlap slightly, but most execution times do not overlap. So I don’t know why mps gives us more chance to utilize GPU resources.

CC >= 3.5 support thread block level preemption.
CC >= 6.0 support instruction level preemption.

NVVP/CUPTI is capturing the start timestamp and end timestamp of the grid launches and memory copies. However, the tool does not currently capture GPU context switch events. In the future the tool will hopefully show context switch events and correctly segment the launch.

In normal trace it is possible to see slight overlap even in a stream. The work is not overlapping. This is an artifact of the method used to collect the end timestamp of a kernel. This overlap should rarely be > 500 ns.

Nsight VSE CUDA trace and Nsight Systems (mobile) tools can show when a GPU is active for a specific context helping to determine false concurrency.