Multiple processes sharing the same device: MPI+CUDA

Hi all,

I have multiple processes sharing a single device. Each process involves transferring some data to the GPU followed by a kernel call. I understand that CUDA creates a different context corresponding to each process and that kernels in different context are executed serially. So, I was wondering if the same is true for memory transfers as well?

In other words, Is memory transfer from multiple processes to a GPU executed serially or concurrently?


PCIE bus transfers in a particular direction are always serialized, even when they originate from the same process. I believe transfers in opposite directions can overlap, if you have satisfied the rules for concurrent transfers.

You may want to take a look at CUDA MPS.

Thanks, I’ll look at MPS.

So then, there is no real point of having multiple processes the same device using normal CUDA right? Since the calls will all be executed serially.

It certainly simplifies mattes if you only have one process (less space used for contexts, potential for overlapping kernels). You lose memory protection of course.
PCIe transfers in opposite direction can overlap if you use Tesla or Quadro cards. On GeForce cards you can program around that issue by using cudaMemcpy() for one direction and mapped (aka zerocopy) memory and a custom kernel for the other. But you’d still need the kernels to be in the same context, i.e. come from the same process.