I’m writing 2 programs in Windows 10 to do something like below:
app1: Take an opencv gpuMat do some compute and save the huge date into memory
app2: read from the memory and render it.
I’m now using the host memory to transmit the huge amount of data and it is too slow.( app1 device memory -> app1 host memory -> app2 host memory -> app2 device memory )
What I want is to know if there is a way to pass the device pointer from app1 to app2.
I know there is a way to doing it in Linux with the Inter-process communication (IPC). I am exploring the idea of cuda contexts. I found this question asked in 2009 https://devtalk.nvidia.com/default/topic/418234/?comment=2920332#reply
However I know at that time contexts were related to threads. That changed later to include multi-threding and now they are related per device per processes. https://devtalk.nvidia.com/default/topic/519087/cuda-context-and-threading/?offset=4
I think each app in my case has its own context, being each one a process itself. However I also found that it is not possible to pass pointers between contexts. However reading the driver API https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX, there are functions to pop and push contexts, which gave me the idea of a process driven by app1 being able to pop its cuda context and somehow the process driven by app2 can push it in its stack and in that way I can retrive my data pointer created in app1 and use it in app2.
Is this possible? If so, how?
// App1: has data stream send by MPI with a cpu memory to an image h_inIm, // I also have access to ncols and nrows // there are other ways to create the gpuMat, for example: creating a mat header for h_inIm // and then using cv::Mat::upload to create inImg. However I want the cuda context so I use the // cuda runtime calls to make sure I have created a context CUresult a; CUcontext pctx; cudaSetDevice(0); // runtime API creates context here unsigned int arraySz = nrows * ncols*sizeof(float); const size_t step = ncols * sizeof(float); float * d_inIm; cudaMalloc((void **) &d_inIm, arraySz); cudaMemcpy(d_inIm, h_inIm, arraySz, cudaMemcpyHostToDevice); cv::cuda::GpuMat inImg = cv::cuda::GpuMat(nrows, ncols, CV_32F, static_cast<char*>(*d_inIm),step); cv::cuda::GpuMat outImg(nrows, ncols, CV_32F); //... do some operation with opencv in the gpu so, d_inIm becomes d_outIm someFunc(inImg,outImg); // Now I want to pass a pointer to outImg, this is outImg.data in a stream through MPI to App2 // I do not want to go to the host to pass the data //My problem is that as soon as app1 is finished outImg's pointer will get out of scope //and I am not able to pass it. I think here I can do something like: a = cuCtxGetCurrent(&pctx); assert(a == CUDA_SUCCESS); a = cuCtxPopCurrent ( &pctx ); //can I send pctx through a stream using MPI ? If so how app2 will acquire it? // My intuition tells me that it will get destroyed when app1 finished. //Or can I use a = cuCtxPopCurrent ( &pctx ); in app1 as my last call? // then inside app2 has a call to a = cuCtxPushCurrent ( &pctx ); ?
I would really appreciate any help you can provide.