CUDA inter-process data transfer and cuMemcpyPeer seg fault

Hello, all!

I am working on Inter-process CUDA buffer transferring: the goal is to let inter-process communications benefits from some of the new features of CUDA Direct 2.0, which will eliminate data transferring through host-side main memory. In general, my approach is by creating a shared cuda context for two processes. I can prove that the shared cuda context actually carries the necessary information for two processes for direct data transfer. But I came across a segmentation fault in cuMemcpyPeer() function call. I hope I can get some help in this forum or from Nvidia.

All good features of CUDA Direct 2.0 today actually happens within a single process. If two CUDA computing processes need to transfer some data between them, nowadays we have to copy the data back to host-side main memory and do generic IPC there. What I am trying to do is to see whether CUDA direct peer copy can be done in this situation.

Since there is a virtual memory layer in CUDA, I believe that a cuda context actually contains the information on the mapping of a virtual address to a gpu device physical address.

So I created a cuda context (i.e. the shared context) and a device buffer (the shared buffer) in a process, and then fork() it. At this point the address space of the parent process is the same with that of the child process, which includes the shared context. After that, each process creates new context to do normal cuda computation, so it won’t mess up the shared context. When the child process tries to send a device memory buffer to the parent, it pushes the shared context (to make it as the primary context) and does cuMemcpyPeer() to copy the data into the shared buffer in the shared context. And symmetrically, the parent copies the data from the shared buffer out. Here cuMemcpyPeer() generates a segmentation fault in the child process; but it works well in the parent process.

Since the cuMemcpyPeer() is a new function, I am not sure if this corner case is handled in the cuda driver. But I can show that the context information regarding the mapping between the shared buffer address (which is a virtual mem address) to its physical address is correctly kept after fork() in the child process. This is shown by doing simple cuMemcpyHtoD (host to device memory copy) into the shared buffer in the child process and checked in the parent process. This means that the shared context after fork() has carried the necessary information for cuMemcpyPeer() to work in the child process.

My environment is a CentOS 5.5, linux kernel 2.6.36.4, CUDA driver 4.0_linux_64_270.41.19, Tesla M2070 card.

Can anyone from this forum or from Nvidia take a look at my case and perhaps give me some help? Thanks a lot!

Hi all,
I attached a simple piece of code this post, in order to show my problem. There is a choice of in the code to either 1) show the seg fault; 2) show cuMemcpyHtoD() works with shared context.

Thanks for taking a look!

PS. I use “-arch=sm_20 -lcuda” when compiling.

simplePeerTest.cu (9.25 KB)

No part of CUDA works between processes, so this is not expected to work at all.

Hi tmurray,

Thanks for the reply! I see that CUDA official documents have not suggested this way of across-process usage. That’s part of the reason that I want to see if it is possible. You see, in the case of using

cuMemcpyHtoD(shared_dev_buf, local_host_buf, size)

in the fork()'ed child process, the across-process usage actually works. This makes me really want to know if I could do the similar thing with the other case, with

cuMemcpyPeer(shared_dev_buf, shared_context, local_dev_buf, local_context, size)

Do you see no possibility at all even considering that cuMemcpyHtoD is working?

If you can point me to someone or some relevant article, that will be very helpful. Thanks!

Well, CUDA behavior after a fork is totally undefined. It certainly doesn’t work consistently.

There’s no chance this will work in CUDA 4.0. None whatsoever.