We are recently developing an application where data has to be shared from a parent process to child processes using system shared memory (not CUDA shared memory). In each child process, CUDA pinned memory is allocated, and data shared from its parent process is copied into pinned memory for following kernel launches. Is there a way to directly create OS-level shared memory as CUDA pinned memory? That way, the copy from OS shared memory to pinned memory is not necessary, and thus, we may get some performance improvement.
cudaHostRegister work? CUDA Runtime API :: CUDA Toolkit Documentation