Sharing a pinned region between processes


I have an application that runs as a set of separate processes that perform interprocess communication via an anonymous POSIX shared memory region. I have adapted the program so that each process uses a separate GPU. At the moment, I copy data from the GPU to a pinned bounce buffer and then again into the shared memory region. For algorithmic reasons I’ll not discuss, it would be highly preferable to be able to copy from each GPU to the shared region. Although I can pin the region with mlock(), this apparently isn’t enough for the CUDA runtime to consider it pinned for cuMemcpy purposes.

Is there a way that I can assert to the CUDA runtime that a particular virtual address is to be considered to point to a pinned region? It seems that the “GPU Direct” feature that Mellanox have been trumpeting about is exactly this, but I can’t find any API documentation about it.