I want to remove the upload/download by using the unified memory of the Jeton Nano. In my case, I get an image, stored at GPU as it is part of a GStream pipeline, which I want to access it from the CPU. In order to do that I do the following:
Bus error usually occurs when a buffer is accessed via different processes simultaneously.
Based on your use case, it’s recommended to check if a buffer is used by two adjacency frames at the same time.
The new frame should wait for the task of previous frame finished before accessing.
And how can I check that? Because I do a cudaFree(0) before entering this blog of code and then the one is after saving the image. Can it be that the first cudaFree(0) deletes all the allocated memory and it fails on the current frame?
I solved the core dumped error by synchronizing the host and the device with this function before //some code …:
cudaDeviceSynchronize
However, the code performs 50% slower when allocating by copyTo the image than when is not allocated, as the frame accessed from the host is only used to save it. imwrite function was commented when checking the comparison.
Sorry for my late answer. I was busy with another project these last weeks.
For the problem I mentioned above, I didn’t find any alternative solution, as the images that I get through the pipeline are already stored on the GPU and the only way to access there through the CPU is by copying or downloading the image. However, that was not the only problem I found.
The problem or limitation I found is that I want to have the output of the corner detector and of the optical flow in the shared memory so I don’t need to download/upload the data every iteration several times. In order to do this, I allocate the memory before the corner detector and before the optical flow, however, the output of both is not the expected, as both modify the size of the GpuMat. In the case of the corner detector if I try to access the data from the CPU I read zeros, while if I download it is non-zero and in the case of the optical flow, when I read from the CPU I get the expected value. However, sometimes when I execute the optical flow allocating the memory before it’s performance is worse than without allocating the memory.
In the code, there are comments where it explains what you can comment to try the different things. Let me know if you have problems compiling it. The main file is tracking_test.cpp, which contains a comment on how to compile the project.
In the code, you will also see why is important for me to have the same output size of the GpuMat and the Mat from the optical flow function.