I’ve been working with the CUDA Graphs Green Context technology for a while, and a few questions have come up that I haven’t been able to resolve through the available documentation.
Most of the examples I’ve seen use Green Context within a single process — typically creating multiple Green Contexts and launching them concurrently in that same process. [link here]
However, I wonder if this technology is also compatible across different processes. Specifically:
Can I launch two separate processes in parallel, each creating and using its own Green Context independently?
Furthermore, is it possible to run processes using Green Context in parallel with others that launch traditional CUDA kernels (i.e., not using Green Context)?
I don’t think there should be any issue with having two separate processes, both using Green Context, and/or one process using Green Context and one not. CUDA GPUs have basic process isolation as well as context switching between processes. To a first order approximation, what process A does should not materially impact what process B can do, with a couple obvious exceptions:
The issue I’m encountering when trying to launch processes in parallel — one using Green Context and the other not — is that they don’t seem to execute concurrently on the GPU. Theoretically, this shouldn’t happen, right? Or is this what you’re referring to as the effect on performance/throughput?
I’m attaching an image from an NVIDIA Nsight Systems report. In this case, both processes are launching the same kernel, with the only difference being that one uses Green Context (with 90% of the GPU resources assigned) and the other doesn’t.
What I don’t understand is why — for every full execution of the kernel in the Green Context — the kernel without reserved resources gets scheduled and executed multiple times in between, even though they’re running the exact same workload.
generally, separate processes don’t execute concurrently on the GPU. That is true whether we are talking about green contexts or not.
When multiple processes are launched on a single GPU without MPS, the GPU will context-switch between the processes. That means at any given instant, when a kernel belonging to one process is executing, kernels belonging to other processes cannot/will not be executing.
There is a lot of unpublished detail here. The exact context-switching behavior can vary in terms of when the context-switching happens, but the statement I made is still correct; at any given instant, only kernel(s) from one process can execute. Modern inter-process context-switching on modern GPUs typically follows a time-sliced behavior in my experience, but observations may vary and I don’t know for sure if there is any context-switching nuance in the Jetson case. But given your relatively “long” kernel durations here (10’s of seconds, it seems) then it certainly seems to me that inter-process context switching is happening on a time-slicing basis, which gives the “illusion” that both kernels are executing “simultaneously”.
Looking at your profiler output, it seems evident that the heavyKernel duration in the green context case is substantially longer (around perhaps 22s in duration) than the duration in the non-green-context case (perhaps around 11 seconds?) That should be the proximal explanation for the difference in throughput. Perhaps you haven’t assigned 90% of resources as claimed, or perhaps green context is slowing the kernel execution down for some other reason. I guess if I were studying this, I would first start by studying kernel duration for each process independently (i.e. running only one process at a time). If they exist still in a 2:1 ratio, then there is no reason to assume multi-process behavior has anything to do with this.