CUDA inter-process data transfer and cuMemcpyPeer seg fault

fji · July 22, 2011, 10:25pm

Hello, all!

I am working on Inter-process CUDA buffer transferring: the goal is to let inter-process communications benefits from some of the new features of CUDA Direct 2.0, which will eliminate data transferring through host-side main memory. In general, my approach is by creating a shared cuda context for two processes. I can prove that the shared cuda context actually carries the necessary information for two processes for direct data transfer. But I came across a segmentation fault in cuMemcpyPeer() function call. I hope I can get some help in this forum or from Nvidia.

All good features of CUDA Direct 2.0 today actually happens within a single process. If two CUDA computing processes need to transfer some data between them, nowadays we have to copy the data back to host-side main memory and do generic IPC there. What I am trying to do is to see whether CUDA direct peer copy can be done in this situation.

Since there is a virtual memory layer in CUDA, I believe that a cuda context actually contains the information on the mapping of a virtual address to a gpu device physical address.

So I created a cuda context (i.e. the shared context) and a device buffer (the shared buffer) in a process, and then fork() it. At this point the address space of the parent process is the same with that of the child process, which includes the shared context. After that, each process creates new context to do normal cuda computation, so it won’t mess up the shared context. When the child process tries to send a device memory buffer to the parent, it pushes the shared context (to make it as the primary context) and does cuMemcpyPeer() to copy the data into the shared buffer in the shared context. And symmetrically, the parent copies the data from the shared buffer out. Here cuMemcpyPeer() generates a segmentation fault in the child process; but it works well in the parent process.

Since the cuMemcpyPeer() is a new function, I am not sure if this corner case is handled in the cuda driver. But I can show that the context information regarding the mapping between the shared buffer address (which is a virtual mem address) to its physical address is correctly kept after fork() in the child process. This is shown by doing simple cuMemcpyHtoD (host to device memory copy) into the shared buffer in the child process and checked in the parent process. This means that the shared context after fork() has carried the necessary information for cuMemcpyPeer() to work in the child process.

My environment is a CentOS 5.5, linux kernel 2.6.36.4, CUDA driver 4.0_linux_64_270.41.19, Tesla M2070 card.

Can anyone from this forum or from Nvidia take a look at my case and perhaps give me some help? Thanks a lot!

fji · July 26, 2011, 7:59pm

Hi all,
I attached a simple piece of code this post, in order to show my problem. There is a choice of in the code to either 1) show the seg fault; 2) show cuMemcpyHtoD() works with shared context.

Thanks for taking a look!

PS. I use “-arch=sm_20 -lcuda” when compiling.

simplePeerTest.cu (9.25 KB)

tmurray · July 27, 2011, 6:29pm

No part of CUDA works between processes, so this is not expected to work at all.

fji · July 27, 2011, 9:35pm

Hi tmurray,

Thanks for the reply! I see that CUDA official documents have not suggested this way of across-process usage. That’s part of the reason that I want to see if it is possible. You see, in the case of using

cuMemcpyHtoD(shared_dev_buf, local_host_buf, size)

in the fork()'ed child process, the across-process usage actually works. This makes me really want to know if I could do the similar thing with the other case, with

cuMemcpyPeer(shared_dev_buf, shared_context, local_dev_buf, local_context, size)

Do you see no possibility at all even considering that cuMemcpyHtoD is working?

If you can point me to someone or some relevant article, that will be very helpful. Thanks!

tmurray · July 27, 2011, 10:06pm

Well, CUDA behavior after a fork is totally undefined. It certainly doesn’t work consistently.

There’s no chance this will work in CUDA 4.0. None whatsoever.

Topic		Replies	Views
How can I pass data across two contexts cuMemcpyPeer across contexts CUDA Programming and Performance	1	3222	June 28, 2011
the problem of using cuMemcpyPeer CUDA Programming and Performance	0	631	October 10, 2016
CUDA + Fork = ? CUDA Programming and Performance	5	13754	December 2, 2008
CUDA and fork() CUDA Programming and Performance	3	12794	December 22, 2007
Host->Device memcpy failure in forked process valgrind output included CUDA Programming and Performance	4	7191	February 26, 2008
Transfortm intel mpi to openmpi when use cuda Legacy PGI Compilers	1	563	November 18, 2020
Sharing cuDNN context between processes cuDNN	2	620	November 30, 2021
cudaMemcpyPeer fails with error 11 (invalid argument) CUDA Programming and Performance	11	3287	October 22, 2014
problem with cudaMemcpyPeer() - won't do copying CUDA Programming and Performance	0	1487	February 16, 2012
Peer to peer (UVA) memcpy not working CUDA Programming and Performance cuda	1	102	November 15, 2024

CUDA inter-process data transfer and cuMemcpyPeer seg fault

Related topics