Share GPU/host pinned memory between host processes

I see in CUDA 4.1 there will be the ability for peer-to-peer communications across SEPERATE host processes. If this is possible, then certainly it must be possible to share GPU device memory and/or host-pinned memory across multiple processes–using the same GPU. Is this possible in Cuda 4.0? If not, is it possible in Cuda 4.1? What is the mechanism–would one P2P across processes to use memory on the same device? My situation is that I have one Quadro GPU and multiple host processes. Currently, in order to share memory between processes, I have to copy the memory out to a host-shared memory buffer in one process (using an mmap memory buffer) and then copy the data back to the card for the other process to use it. Currently, I do not want to combine the two processes into one. If Cuda 4.1 can P2P across multiple processes and multiple devices, I must be able to do this with multiple processes and a single device. Any ideas? Thanks in advance for your time on this…

Yes, the cudaIpc* functions in 4.1 (not 4.0) appear to allow you to take a device pointer and create a “memory handle”. Unlike the device pointer, the memory handle is portable to other processes, where it can be turned back into a device pointer valid in that process. You can also use a similar set of functions to make a CUDA event in one process available to another process. Check the CUDA Toolkit Reference Manual for more details.

seibert–thanks very much for the reply. For anyone interested, I was able to allocate device memory (using cudaMalloc) in one process(A) and then used cudaIpcGetMemHandle to get the ‘handle’ to this memory. I then used a host-side IPC mechanism to share the handle (in my case, I used mmap to create a shared buffer for the only purpose of sharing this handle between processes). In another process(B), I used cudaIpcOpenMemHandle to translate this handle back to a device memory pointer–process B was then able to use process A’s allocated memory. One note, in the reference manual for Cuda 4.1 (RC2), it lists only two arguments being required for cudaIpcOpenMemHandle. However, this gives a compiler error, as there are actually 3 arguments required–the last one must be cudaIpcMemLazyEnablePeerAccess according to the header file (maybe the final Cuda release will remove the requirement to add this argument, as this is the ONLY value that this argument can take).

NVIDIA guys–if you read this, thank you for this functionality!!!

Awesome. Glad to see someone trying this feature out!

Out of curiosity: Do you cudaFree the device memory from process B, or do you cudaFree it from process A once process B signals that it is done with it?

int main(int argc, char **argv)

{

int nsize = 100;

float *d_x;

cudaIpcMemHandle_t *handle = NULL;

cudaMalloc((void **)&d_x, nsize*sizeof(float));

cudaError_t err;

err = cudaIpcGetMemHandle(handle, d_x);

if(err != cudaSuccess){

    printf("unsuccessful\n");

    fflush(stdout);

}   

cudaFree(d_x);

return 0;

}

This code give me seg fault in cudaIpcGetMemHandle. Any idea how to debug this?

cudaIpcMemHandle_t handle;

cudaMalloc((void **)&d_x, nsize*sizeof(float));

cudaError_t err;

err = cudaIpcGetMemHandle(&handle, d_x);

if(err != cudaSuccess){

printf("unsuccessful\n");

fflush(stdout);

}