Invoking kernel from multiple PC processes


I’ve developed an algorithm using cuda. It works fine when only one process invoking it.

But when multiple processes using it concurrently, the performance dropped observably, but the GPU occupancy was still low.

I noticed in cuda3.0, kernels from different context couldn’t run concurrently, I think that’s why the performance dropped. Is it correct??

So how can i prevent the performance dropping with multiple processes?? Since the algorithm using little blocks, it will be great if different kernels can run concurrently.

PS, why kernels from different context couldn’t run concurrently?? Is it because of limit of hardware design?

Thanks in advance.

There is no solution to your problem at this point. CUDA devices have to go through a context switch (much like a CPU) to shift control from one process to another, and this overhead will impact your overall throughput. This switching overhead has gone down in newer CUDA releases, and I vaguely recall statements that it was better for Fermi devices, though I have not benchmarked this myself.

The best way to improve your overall throughput is to increase the amount of work each kernel does so that the context-switching overhead is not significant. The other option is to combine your multiple processes into one process, and assign your different CUDA kernels to different CUDA streams so that they will run concurrently on Fermi devices.

I have no idea for the technical reasons why CUDA devices can’t execute kernels from different contexts at once. Possibly it is related to an inability to manage multiple virtual memory spaces (each context gets its own in CUDA) simultaneously in an efficient matter. That could definitely require newer hardware to improve, but this is all speculation.

However, once this limitation is fixed the graphics driver will be able to coexist with CUDA kernels, and we will finally see the end of the watchdog and the ability to run cuda-gdb on the same device also rendering the GUI.