I think, in short, the scenario you describe will not be possible, exactly.
First of all, I would recommend that you familiarize yourself with the cuda simpleIPC application:
[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#simpleipc[/url]
Although it’s not a direct implementation of your scenario, it illustrates some important concepts. One important concept is that a CUDA context should not be established in a parent process if the GPU(s) are intended to be used in a child process. Fork the process first, then establish the CUDA context. CUDA contexts are unique to each process, and in general are not shareable amongst processes (although contexts from separate child processes can coexist, and “share” devices).
The above info should shed some light on why you are seeing errors in the child process when you have performed a CUDA operation in the parent process before the fork.
Regarding your objective, in a nutshell, I don’t believe you will be able to share a mapped pointer in any way. The process of host allocation, and pinning of the memory, is something that can be done once. But the mapping registration process is something that will have to be replicated in each cuda context, where you want to use the host pointer as a mapped device pointer.
To be sure, I’ve tested a few other scenarios, such as attempting to share the mapped pointer via cudaIPC, and also attempting to directly extract a device pointer from a shmget/shmat pointer that has been registered/pinned/mapped in another process. Niether idea works. (And, pondering it, I’m not surprised.)
I presume your motivation is that you don’t want to pay the overhead of mapping/pinning, and that is the reason for your statement:
" it is not an option to call cuMemHostRegister for each process fork(2)."
Some of the overhead is mitigated by (allocating and) pinning in one process, leaving only the mapping operation to be performed in other processes. But as a quick test, however, the mapping operation was significant. The map/pin process required 0.7s for a 1GB allocation. The map-only process required 0.4s (CUDA 6.5/RHEL 6.2) per process.
If the large overhead is what you are trying to avoid, the best suggestion I can offer is to consider if you can run your application as a multi-threaded one rather than a multi-process one. This will result in a significant reduction in code complexity, and also avoid the multiple-context/multiple-process overhead issues I’ve mentioned here.
I can share some code that I used to evaluate some of this, if you’re interested. However it doesn’t use the driver API nor does it demonstrate a way to achieve exactly what you want, so I’ve omitted it for now. It’s just a hacked up version of the simpleIPC app.