Launching multiple kernels in same context vs multiple kernels

When use multithreading to launch multiple kernels to the same device – and by default, all the threads of the same process have the same CUcontext – I realized that all the kernels are processed in serial. However, when launching the same number of kernels form different processes, the kernels are processed in parallel. Is this because, when there are multiple kernel launches to the same GPU, the GPU processes them in serial? Please share any details you know regarding CUcontext and parallel kernel launches.

Thanks

Usually that’s done with persistent launch grids that have less blocks than there are available multiprocessors.

It may be hard to get another kernel launch achieve simultaneous execution though.
There are a couple of threads on these forums that talk about concurrent kernel execution.

https://devtalk.nvidia.com/default/topic/937022/concurrent-kernel-execution/
https://devtalk.nvidia.com/default/topic/1046981/deep-dive-in-concurrent-kernel-launches/

Oh, sorry for not being clear. I am more interested in the relationship between concurrent kernel launches vs CUcontext.

What I have observed: Launching process “A” from 3 different processes increases throughput by about 3 times and increases volatile GPU utilztion ( checked with nvidiai-smi ) from 30% to 90%. However, when I launch process “A” from 3 different threads of the same process, and all the threads are using the same CUcontext obj, the utilizations stays at 30% and there is no speed up, which goes to show that: when launching from multiple processes, there is some concurrent kernel launch, but from multiple threads, there is not.

So:

  1. Is it due to using the same CUcontext that I don’t see any concurrency?
  2. or is it that they are from the same processes’s threads?

kernels launched from separate processes do not run concurrently. They may appear to run concurrently, but they do not.

This can be modified somewhat if MPS is used.

It is true that separate processes will use separate CUDA contexts in the non-MPS case.

The significance of the utilization metric in nvidia-smi is discussed here:

https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation/40938696#40938696

Therefore if you have a process that launches a kernel that occupies about 30% of the GPU execution timeline, then its entirely possible that running 3 of those processes will cause the utlization reported by nvidia-smi to go to 90%. It’s certainly possible to do something similar in the multithreaded scenario.

Thanks Rober_Crovella for the clarification.

So the utilization reported in nvidia-smi is only the percentage of time at least one kernel was present, and it doesn’t tell anything about how the kernel is using the resources.

  1. It is weird that I get 30% when I launch process “A”. in the example benchmark you wrote in StackOverflow, you could observe lower utilization when there the process was a sleep. This makes me think that there is some code that makes process “A” fall a sleep or do computation on CPU instead of on GPU. What are the usual possible causes of such low utilization?
  1. Why would launching 3 kernels from 3 separate processes allow utilization to go upto 90% ( and the speed to go up to 3times ) but not when lunching from 3 threads of the same process? This would really help me figure out how to do this for multi-threaded scenario.