How to share a device pointer between two processes in windows 10

I’m writing 2 programs in Windows 10 to do something like below:

app1: Take an opencv gpuMat do some compute and save the huge date into memory

app2: read from the memory and render it.

I’m now using the host memory to transmit the huge amount of data and it is too slow.( app1 device memory -> app1 host memory -> app2 host memory -> app2 device memory )

What I want is to know if there is a way to pass the device pointer from app1 to app2.

I know there is a way to doing it in Linux with the Inter-process communication (IPC). I am exploring the idea of cuda contexts. I found this question asked in 2009 https://devtalk.nvidia.com/default/topic/418234/?comment=2920332#reply
However I know at that time contexts were related to threads. That changed later to include multi-threding and now they are related per device per processes. https://devtalk.nvidia.com/default/topic/519087/cuda-context-and-threading/?offset=4
I think each app in my case has its own context, being each one a process itself. However I also found that it is not possible to pass pointers between contexts. However reading the driver API https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX, there are functions to pop and push contexts, which gave me the idea of a process driven by app1 being able to pop its cuda context and somehow the process driven by app2 can push it in its stack and in that way I can retrive my data pointer created in app1 and use it in app2.
Is this possible? If so, how?

// App1: has data stream send by MPI with a cpu memory to an image h_inIm, 
// I also have access to ncols and nrows
// there are other ways to create the gpuMat, for example: creating a mat header for h_inIm
// and then using cv::Mat::upload to create inImg. However I want the cuda context so I use the 
// cuda runtime calls to make sure I have created a context


CUresult a;
CUcontext pctx;
cudaSetDevice(0); // runtime API creates context here

unsigned int arraySz = nrows * ncols*sizeof(float);
const size_t step    = ncols * sizeof(float);
float * d_inIm;
cudaMalloc((void **) &d_inIm, arraySz);
cudaMemcpy(d_inIm, h_inIm, arraySz, cudaMemcpyHostToDevice);    
cv::cuda::GpuMat inImg = cv::cuda::GpuMat(nrows, ncols, CV_32F, static_cast<char*>(*d_inIm),step);
cv::cuda::GpuMat outImg(nrows, ncols, CV_32F);
//... do some operation with opencv in the gpu so, d_inIm becomes d_outIm
someFunc(inImg,outImg);

// Now I want to pass a pointer to outImg, this is outImg.data in a stream through MPI to App2
// I do not want to go to the host to pass the data
//My problem is that as soon as app1 is finished outImg's pointer will get out of scope 
//and I am not able to pass it. I think here I can do something like:

a = cuCtxGetCurrent(&pctx);
assert(a == CUDA_SUCCESS);
a = cuCtxPopCurrent ( &pctx ); 

//can I send pctx through a stream using MPI ? If so how app2 will acquire it? 
// My intuition tells me that it will get destroyed when app1 finished.
//Or can I use a = cuCtxPopCurrent ( &pctx ); in app1 as my last call?
// then inside app2 has a call to a = cuCtxPushCurrent ( &pctx ); ?

I would really appreciate any help you can provide.

Different processes occupy separate memory spaces - the address space is local to each process. This concept also applies to CUDA. It’s a fundamental security feature of modern multitasking computer operating systems.

You can share a CUDA context among the threads belonging to the same process, but this is impossible accross process boundaries. Anything else would create security problems that would make Meltdown and Spectre look harmless in comparison ;)

Christian

Cross link:

https://stackoverflow.com/questions/50513983/share-a-gpu-pointer-between-processes-in-windows-10-cuda-9