I’m developing an application that needs to copy very small chunks of data (1 kB) received by a NIC to the GPU. Latency is absolutely critical in this application and so rather than batching up a reasonable amount of packets (e.g. 1 MB) for use in a cudaMemcpy, I really must copy 1024 bytes at a time.
The GDRcopy library (https://github.com/NVIDIA/gdrcopy) is a great match for this workload. The GDRcopy example application works just fine, and if implement something similar in my application it works great.
The “catch” is that GPU memory that I wish to write to is allocated in a separate GPU context/process and I acquire my handle to it with a call to cudaIpcGetMemHandle(). What I don’t seem to be able to accomplish is to get the device memory allocated in another context to be correctly configured for use via GDRcopy in my application. The necessary call to cuPointerSetAttribute (&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_ptr) fails with CUDA_ERROR_INVALID_VALUE. Similarly the call to gdr_pin_buffer() also fails.
I suspect I’m missing a key step to allow the memory allocated in the other GPU context from being mapped. I would have thought this would be a common use case for GPUDirect based RDMA applications.
Any advice on what to try would be greatly appreciated