I have multiple processes sharing a single device. Each process involves transferring some data to the GPU followed by a kernel call. I understand that CUDA creates a different context corresponding to each process and that kernels in different context are executed serially. So, I was wondering if the same is true for memory transfers as well?
In other words, Is memory transfer from multiple processes to a GPU executed serially or concurrently?
PCIE bus transfers in a particular direction are always serialized, even when they originate from the same process. I believe transfers in opposite directions can overlap, if you have satisfied the rules for concurrent transfers.
It certainly simplifies mattes if you only have one process (less space used for contexts, potential for overlapping kernels). You lose memory protection of course.
PCIe transfers in opposite direction can overlap if you use Tesla or Quadro cards. On GeForce cards you can program around that issue by using cudaMemcpy() for one direction and mapped (aka zerocopy) memory and a custom kernel for the other. But you’d still need the kernels to be in the same context, i.e. come from the same process.