CUDA driver API supports to exchange GPU device memory inter-processes, using cuIpcGetMemHandle() and cuIpcOpenMemHandle().
It is very valuable feature for multiple concurrent processes to refer shared data structure, such as background worker processes of PostgreSQL for parallel query execution.
However, it is restricted to the device memory region acquired by cuMemAlloc(), but managed device memory acquired by cuMemAllocManaged() is not supported.
We require a feature to share managed device memory regions by multiple processes, using existing IPC APIs or new APIs.
cuMemAlloc() immediately allocates GPU device memory with physical page frames.
On the other hands, cuMemAllocManaged() reserved a unified virtual address space of GPU device memory on API calls, then physical page frames shall be assigned on the demand (if Pascal/Volta).
Its demand paging feature is very helpful (1) when application cannot estimate correct size of the result buffer preliminary, or (2) when application may need to handle larger data-set than GPU’s device memory.
My application (extension of PostgreSQL for GPU acceleration: https://github.com/heterodb/pg-strom) has both kinds of workloads; (1) number of result rows generated by JOIN/GROUP BY is not accurately predictable prior to execution. (2) in-memory columnar cache preloads database contents which is obviously larger than GPU device memory.
PostgreSQL is built on multi-processes model because of some reasons, so its extension (including GPU accelerator) must follow the design for implementation.
On the other hands, lack of IPC capability in CUDA9.2 prevents multi-processes application to use managed device memory for shared data structure; like inner-buffer of GpuJoin. If it would support managed device memory region, we will be able to implement these kind of application/functionality much more simple, efficient and straight-forward under multi-processes environment.
The managed device memory also provides unified virtual address between host application and device memory space.
It is NOT significant for the upcoming IPC capability for managed device memory. As long as demand paging is supported, it may be a role of programmer to manage the differences of virtual addresses.
Because the application which opens the IPC handle may already use same virtual address space, all of us understand unified addresses between multi-processes are not possible, however, demand-paging is possible and helpful feature.
It looks to me the managed device memory is implemented using mmap(2) on /dev/nvidia_uvm.
Probably, it is possible to extend the driver to share same physical page frame (host-side) for multiple processes (mm_struct for more correctness) like as SysV shared memory implementation doing.