I’m writing 2 programs in Windows 10 to do something like below:
app1: Take an opencv gpuMat do some compute and save the huge date into memory
app2: read from the memory and render it.
I’m now using the host memory to transmit the huge amount of data and it is too slow.( app1 device memory → app1 host memory → app2 host memory → app2 device memory )
What I want is to know if there is a way to pass the device pointer from app1 to app2.
I know there is a way to doing it in Linux with the Inter-process communication (IPC). I am exploring the idea of cuda contexts. I found this question asked in 2009 https://devtalk.nvidia.com/default/topic/418234/?comment=2920332#reply
However I know at that time contexts were related to threads. That changed later to include multi-threding and now they are related per device per processes. https://devtalk.nvidia.com/default/topic/519087/cuda-context-and-threading/?offset=4
I think each app in my case has its own context, being each one a process itself. However I also found that it is not possible to pass pointers between contexts. However reading the driver API https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX, there are functions to pop and push contexts, which gave me the idea of a process driven by app1 being able to pop its cuda context and somehow the process driven by app2 can push it in its stack and in that way I can retrive my data pointer created in app1 and use it in app2.
Is this possible? If so, how?
// App1: has data stream send by MPI with a cpu memory to an image h_inIm,
// I also have access to ncols and nrows
// there are other ways to create the gpuMat, for example: creating a mat header for h_inIm
// and then using cv::Mat::upload to create inImg. However I want the cuda context so I use the
// cuda runtime calls to make sure I have created a context
CUresult a;
CUcontext pctx;
cudaSetDevice(0); // runtime API creates context here
unsigned int arraySz = nrows * ncols*sizeof(float);
const size_t step = ncols * sizeof(float);
float * d_inIm;
cudaMalloc((void **) &d_inIm, arraySz);
cudaMemcpy(d_inIm, h_inIm, arraySz, cudaMemcpyHostToDevice);
cv::cuda::GpuMat inImg = cv::cuda::GpuMat(nrows, ncols, CV_32F, static_cast<char*>(*d_inIm),step);
cv::cuda::GpuMat outImg(nrows, ncols, CV_32F);
//... do some operation with opencv in the gpu so, d_inIm becomes d_outIm
someFunc(inImg,outImg);
// Now I want to pass a pointer to outImg, this is outImg.data in a stream through MPI to App2
// I do not want to go to the host to pass the data
//My problem is that as soon as app1 is finished outImg's pointer will get out of scope
//and I am not able to pass it. I think here I can do something like:
a = cuCtxGetCurrent(&pctx);
assert(a == CUDA_SUCCESS);
a = cuCtxPopCurrent ( &pctx );
//can I send pctx through a stream using MPI ? If so how app2 will acquire it?
// My intuition tells me that it will get destroyed when app1 finished.
//Or can I use a = cuCtxPopCurrent ( &pctx ); in app1 as my last call?
// then inside app2 has a call to a = cuCtxPushCurrent ( &pctx ); ?
I would really appreciate any help you can provide.