CUDA device pointer host-side processes sharing implementation

Hey guys,

I strongly believe that there is a way to implement device pointer sharing between host-side processes on Windows or on any other OSes and I’m wondering that there are no implementations / examples published yet.

I checked the Linux / Unix - based example and understood the fact that there are some Driver API calls wrapped into CUDA API calls that are used to implement this funcionality.

I found out that I can pin host memory to the shared memory (e.g. provided by Boost library) that is used by several host processes and use a standard way to transfer pinned data between host and device:

cudaHostRegister(hostPtr, hostPtrSize, cudaHostRegisterDefault);//previously allocate hostPtr using e.g. Boost

cudaMalloc((void **) &devicePtr, hostPtrSize);

cudaMemcpy(devicePtr, hostPtr, hostPtrSize, cudaMemcpyHostToDevice);

cudaHostUnregister(hostPtr);

As a result, I can store my data in a host shared memory and invoke data transfer calls (using same sharedMemory IPC) avoiding unnecessary host processes data transfer latency.

So, before to go and implement this mechanism I want to clarify any issues I can have with pinning host processes shared memory to the Driver managed memory (some page locks / page size limitations etc). Any ideas / proposals to check something?