Why exporting and importing CUDA IPC handles in the scope of the same Linux process is not supported?

Hi,

I am aware that opening back a CUDA IPC handle within the scope of the same Linux process is not supported. As a result, in the below code excerpt the call to cudaIpcOpenMemHandle fails before the assert is even reached(error checks have been removed to reduce verbosity):

void* mem_ptr;
cudaMalloc(&mem_ptr, MEM_SIZE);

// Get memory handle
cudaIpcMemHandle_t ipc_handle;
cudaIpcGetMemHandle(&ipc_handle, mem_ptr);

// Use the memory handle to access the allocated memory
void* mem_ptr_2;
cudaIpcOpenMemHandle(&mem_ptr_2, ipc_handle, cudaIpcMemLazyEnablePeerAccess);

//
assert(mem_ptr == mem_ptr_2)

It’d be interesting to understand why such limitation exists in CUDA IPC, and whether there are ways to work around it?

1 Like

Hi @dreqeu, welcome back.

I propose we move this post to the CUDA programming category, I am sure there will be someone able to answer this.

Thanks!

1 Like

since the original allocation is already available in the allocating process, there is no need to use IPC to connect to it, and indeed it is not an example of Inter-Process communication. To “work around” this, just use the allocated pointer directly in the allocating process.

Agreed!

I agree @Robert_Crovella, at first glance importing the IPC handler back into the address space of the exporting OS process may seem gratuitous. On the other hand, I see no logical reason for this pattern to not work in theory, but I bet in practice there is a motivation behind it. Maybe you can shed some light on the practical side?

You mean shed some light on what is going on under the hood? No, I don’t have any comments on that. From a programmers perspective, it works the way it works, and it is certainly possible to access an IPC-oriented allocation from the allocating process. You just don’t use the IPC handle to do it, you use the original allocated pointer.

I haven’t studied it carefully, but cuMemExportToShareableHandle() may also be something to check out. I still don’t think it gives you the symmetry you are looking for, but I haven’t tried exporting and importing a handle to the same process.

Thanks for pointing out the low-level virtual management API Robert.

The main motivation behind the apparent gratuitous pattern is I’d like to wrap device pointers in shared memory friendly shared pointers - ShmSharedPtr - to manage the lifetime of the device memory automatically across the boundaries of multiple Os processes. Once the raw device pointer is wrapped in ShmSharedPtr, the resulting object is to be passed around within the scope of the OS process that allocated the device memory and created the ShmSharedPtr instance as well as in other sibling OS processes. Ideally, ShmSharedPtr should internally hold the IPC handle only and call cudaIpcOpenMemHandle to get hold of the actual device pointer whenever the ShmSharedPtr instance is dereferenced.

Now, ShmSharedPtr could track the PID that allocated the device memory and use either the raw pointer or the IPC handle depending on the PID dereferencing the ShmSharedPtr instance, but that’s an extra layer of complexity that adds to an already complicated design and was looking for ways to simplify it.