Shared memory between GPU and CPU

Hello,

My question is related to this https://devtalk.nvidia.com/default/topic/1064668/jetson-nano/eliminate-upload-download-for-opencv-cuda-gpumat-using-shared-memory-/?offset=8# post.

I want to remove the upload/download by using the unified memory of the Jeton Nano. In my case, I get an image, stored at GPU as it is part of a GStream pipeline, which I want to access it from the CPU. In order to do that I do the following:

void* unifiedPtrMat;
uint bytesMat = dest_height*dest_width*4;
cudaMallocManaged(&unifiedPtrMat, bytesMat);
cv::cuda::GpuMat auxMat( dest_height, dest_width,CV_8UC4, eglFrame.frame.pPitch[0]);

cv::cuda::GpuMat d_mat( dest_height, dest_width,CV_8UC3, unifiedPtrMat);
cv::Mat h_mat ( dest_height, dest_width, CV_8UC3, unifiedPtrMat);   
cuda::cvtColor(auxMat, auxMat, COLOR_BGRA2RGB, 0);
auxMat.copyTo(d_mat);
    
imwrite("../../../DeepStream_Results/frame_"+to_string(dsexample->detectionsTracker[idx].getCountFrames())+".jpg", h_mat);

cudaFree(unifiedPtrMat);

The program works for several frames, each time is different until I get: Bus error (core dumped).

Hi,

Bus error usually occurs when a buffer is accessed via different processes simultaneously.

Based on your use case, it’s recommended to check if a buffer is used by two adjacency frames at the same time.
The new frame should wait for the task of previous frame finished before accessing.

Thanks.

And how can I check that? Because I do a cudaFree(0) before entering this blog of code and then the one is after saving the image. Can it be that the first cudaFree(0) deletes all the allocated memory and it fails on the current frame?

Hi,

Would you mind to share the complete sample code with us so we can check it for you?

Thanks.

Hi,

I solved the core dumped error by synchronizing the host and the device with this function before //some code …:

cudaDeviceSynchronize

However, the code performs 50% slower when allocating by copyTo the image than when is not allocated, as the frame accessed from the host is only used to save it. imwrite function was commented when checking the comparison.

Hi,

Would you mind to share your source with us so we can check it further for you?
Thanks.

Hi,

Is everything okay?
Please let us know if there is still and issue on the OpenCV memory.

Thanks.

Hello,

Sorry for my late answer. I was busy with another project these last weeks.

For the problem I mentioned above, I didn’t find any alternative solution, as the images that I get through the pipeline are already stored on the GPU and the only way to access there through the CPU is by copying or downloading the image. However, that was not the only problem I found.

In this https://mirrobots-my.sharepoint.com/:f:/g/personal/jgc_mir-robots_com/EpZaRKdm26pLstbnHzyalekBFyhTslLZDOhfa0CvfIxLjw?e=ib8h1D link I share a reduced example of what I am trying to do. From some detected objects I want to extract their features using the corner detector and then use these features to keep track of the objects until next detection.

The problem or limitation I found is that I want to have the output of the corner detector and of the optical flow in the shared memory so I don’t need to download/upload the data every iteration several times. In order to do this, I allocate the memory before the corner detector and before the optical flow, however, the output of both is not the expected, as both modify the size of the GpuMat. In the case of the corner detector if I try to access the data from the CPU I read zeros, while if I download it is non-zero and in the case of the optical flow, when I read from the CPU I get the expected value. However, sometimes when I execute the optical flow allocating the memory before it’s performance is worse than without allocating the memory.

In the code, there are comments where it explains what you can comment to try the different things. Let me know if you have problems compiling it. The main file is tracking_test.cpp, which contains a comment on how to compile the project.

In the code, you will also see why is important for me to have the same output size of the GpuMat and the Mat from the optical flow function.

Thanks.

Hi,

I will check your code soon and update more information with you later.
Thanks.

Hi,

I cannot open this link:
https://mirrobots-my.sharepoint.com/:f:/g/personal/jgc_mir-robots_com/EpZaRKdm26pLstbnHzyalekBFyhTslLZDOhfa0CvfIxLjw?e=ib8h1D

Could you help to check it?
Thanks.

Hi,

Try this link:
https://mirrobots-my.sharepoint.com/:f:/g/personal/jgc_mir-robots_com/EpZaRKdm26pLstbnHzyalekB3A0jhiqZBmDY-92oplr6jw?e=BbaBCT

Hi,

Not sure if I understood this correctly.

It’s hard to know how many match will be found before calling the algorithm.
So how do you pre-allocate the buffer for it?

Just check the source of OpenCV.
The buffer may be extended if the volume is not enough:

Maybe you can get more information from the source directly.
Thanks.