GDRCopy to memory allocated in another context

I’m developing an application that needs to copy very small chunks of data (1 kB) received by a NIC to the GPU. Latency is absolutely critical in this application and so rather than batching up a reasonable amount of packets (e.g. 1 MB) for use in a cudaMemcpy, I really must copy 1024 bytes at a time.

The GDRcopy library (GitHub - NVIDIA/gdrcopy: A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology) is a great match for this workload. The GDRcopy example application works just fine, and if implement something similar in my application it works great.

The “catch” is that GPU memory that I wish to write to is allocated in a separate GPU context/process and I acquire my handle to it with a call to cudaIpcGetMemHandle(). What I don’t seem to be able to accomplish is to get the device memory allocated in another context to be correctly configured for use via GDRcopy in my application. The necessary call to cuPointerSetAttribute (&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_ptr) fails with CUDA_ERROR_INVALID_VALUE. Similarly the call to gdr_pin_buffer() also fails.

I suspect I’m missing a key step to allow the memory allocated in the other GPU context from being mapped. I would have thought this would be a common use case for GPUDirect based RDMA applications.

Any advice on what to try would be greatly appreciated

Sorry, the GPUDirect RDMA kernel-mode APIs do not support the creation of DMA mappings for memory imported through IPC.
GDRCopy by using those APIs inherits the same limitations.
I suggest to redesign your app to use the allocate and export the buffer in the same process which utilizes GDRCopy