cudaIpcGetMemHandle with mapped/pinned memory

I am trying to use cudaIpcGetMemHandle with mapped/pinned memory on a Jetson TX1, but I keep getting an invalid argument error.

Code (only pertinent lines):

cudaIpcMemHandle_t handle;
int size = 100;
void* ptr = NULL, *ptrDev = NULL;
cudaHostAlloc(&ptr, size, cudaHostAllocMapped);
cudaHostGetDevicePointer(&ptrDev, ptr, 0);
...
cudaIpcGetMemHandle(&handle, ptrDev);

The above code results in “invalid argument” error. However, the following code works:

cudaIpcMemHandle_t handle;
int size = 100;
void* ptr = NULL, *ptrDev = NULL;
cudaMalloc(&ptrDev, size);
...
cudaIpcGetMemHandle(&handle, ptrDev);

So, the problem is specifically with mapped/pinned memory. Is it not possible to use the Ipc functions with mapped/pinned memory or am I doing something wrong?

Any help is appreciated!

The documentation:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g8a37f7dfafaca652391d0758b3667539

says:

"Takes a pointer to the base of an existing device memory allocation created with cudaMalloc and exports it for use in another process. "

It seems fairly evident that excludes allocations created with cudaHostAlloc or the like.

You should be able to use ordinary linux-based IPC mechanisms to communicate host memory (not pinned) from one process to another, for use in host code. google is probably your friend there. If you were wanting to communicate device-mapped memory for use in device code in another process, I think that is probably not supported via cuda IPC registration.

Thanks for pointing out the “created with cudaMalloc” portion - I obviously missed that.

On the linux-based IPC mechanisms, it’s unclear to me how I can share memory allocated by cudaHostAlloc(…, cudaHostAllocMapped) using, for example, mmap. While mmap takes an address as a parameter, it is only used as a hint and AFAIK, there is no guarantee that the address passed in (which would be the one retrieved by cudaHostAlloc) would be the one returned by mmap.

I suppose one possibility would be to use the address returned by mmap and then register it with cuda using cudaHostRegister(). I’ll give this a shot and report back.

Trying to register the output of mmap with cudaHostRegister(…, cudaHostRegisterMapped) does not work. This results in an “operation not permitted” error.

So I guess I’m back to the drawing board in terms of how to share mapped memory between 2 processes and avoid any unnecessary memcpy’s.

Yes, I was thinking of mmap

mmap has a parameter MAP_FIXED that converts the hint into a demand.

But I agree that it may not work, as cudaHostAlloc already does a mmap with MAP_FIXED, and so I think a second mmap request might fail for that reason (although maybe worth a try).

So based on that I have no further ideas, and indeed it may not be possible to use a region returned by cudaHostAlloc as a linux IPC vehicle.

I am also trying to this same thing. Is there any new updates in CUDA library which might add support for this (cudaIPC for cudaHostAlloc() for two different kernels in two different processes to use the same host pinned memory)?

I have been able to share the host pinned memory between processes. Just need a mechanism for the kernels also be able to access the memory.

I thought you said you had it working here:

https://devtalk.nvidia.com/default/topic/1037324/cuda-programming-and-performance/how-to-share-cuda-memory-between-two-processes-/post/5270122/#5270122

Okay. I found the solution.

The first thing that needs to be done is to use shm_open() and mmap() to share host pinned memory between different processes. Then in each process, you need to call cuMemHostRegister() and then cudaHostGetDevicePointer(). This should do the trick.

I think this is not very clear in the CUDA documentation and requires bit of trial and error to get right.

@txbob:
It was working but I was assuming that the device addresses are same in different processes (as I was using MPS).

That was working fine, but then to debug something I tried using cuda-memcheck and it started complaining because of the assumptions that I was making.

So I decided to drop the assumption and use cudaIpcGetMemHandle() and cudaIpcOpenMemHandle() (as you suggested in first place). This is when I realized these fucntions don’t work with host pinned memory. Anyways, I was luckily able to figure out the right thing to do.

Thanks for the suggestion.