I am working on Inter-process CUDA buffer transferring: the goal is to let inter-process communications benefits from some of the new features of CUDA Direct 2.0, which will eliminate data transferring through host-side main memory. In general, my approach is by creating a shared cuda context for two processes. I can prove that the shared cuda context actually carries the necessary information for two processes for direct data transfer. But I came across a segmentation fault in cuMemcpyPeer() function call. I hope I can get some help in this forum or from Nvidia.
All good features of CUDA Direct 2.0 today actually happens within a single process. If two CUDA computing processes need to transfer some data between them, nowadays we have to copy the data back to host-side main memory and do generic IPC there. What I am trying to do is to see whether CUDA direct peer copy can be done in this situation.
Since there is a virtual memory layer in CUDA, I believe that a cuda context actually contains the information on the mapping of a virtual address to a gpu device physical address.
So I created a cuda context (i.e. the shared context) and a device buffer (the shared buffer) in a process, and then fork() it. At this point the address space of the parent process is the same with that of the child process, which includes the shared context. After that, each process creates new context to do normal cuda computation, so it won’t mess up the shared context. When the child process tries to send a device memory buffer to the parent, it pushes the shared context (to make it as the primary context) and does cuMemcpyPeer() to copy the data into the shared buffer in the shared context. And symmetrically, the parent copies the data from the shared buffer out. Here cuMemcpyPeer() generates a segmentation fault in the child process; but it works well in the parent process.
Since the cuMemcpyPeer() is a new function, I am not sure if this corner case is handled in the cuda driver. But I can show that the context information regarding the mapping between the shared buffer address (which is a virtual mem address) to its physical address is correctly kept after fork() in the child process. This is shown by doing simple cuMemcpyHtoD (host to device memory copy) into the shared buffer in the child process and checked in the parent process. This means that the shared context after fork() has carried the necessary information for cuMemcpyPeer() to work in the child process.
My environment is a CentOS 5.5, linux kernel 126.96.36.199, CUDA driver 4.0_linux_64_270.41.19, Tesla M2070 card.
Can anyone from this forum or from Nvidia take a look at my case and perhaps give me some help? Thanks a lot!