The upload and download speeds in OpenCV CUDA are significantly different

Hello Everyone ,

I recently encountered some issues while using Opencv cuda::class.

I upload 4 images of size 19481096 CV_8UC3 to the GPU for operations such as remap and binarization, and eventually obtain 12 images of size 19481096 CV_8UC1 to be downloaded back to the CPU for further processing.

If I’m not mistaken, the amount of data to be processed should be the same for uploading and downloading.

However, there is a huge difference in the speed of these two operations, with uploading taking about 4-5 ms and downloading taking about 30 ms.

Did I do something wrong?

Any suggestions?

Thanks in advance.

Here is part of my code

img_0.upload(tmp_0 , stream_0);
img_1.upload(tmp_1 , stream_1);
img_2.upload(tmp_2 , stream_2);
img_3.upload(tmp_3 , stream_3);

/*

Do some image processing

*/

mask_blue_0.download(imgThresholded_blue_0 , stream_download_blue_0 );
mask_blue_1.download(imgThresholded_blue_1 , stream_download_blue_1 );
mask_blue_2.download(imgThresholded_blue_2 , stream_download_blue_2 );
mask_blue_3.download(imgThresholded_blue_3 , stream_download_blue_3 );

mask_pink_0.download(imgThresholded_pink_0 , stream_download_pink_0 );
mask_pink_1.download(imgThresholded_pink_1 , stream_download_pink_1 );
mask_pink_2.download(imgThresholded_pink_2 , stream_download_pink_2 );
mask_pink_3.download(imgThresholded_pink_3 , stream_download_pink_3 );

mask_green_0.download(imgThresholded_green_0 , stream_download_green_0 );
mask_green_1.download(imgThresholded_green_1 , stream_download_green_1 );
mask_green_2.download(imgThresholded_green_2 , stream_download_green_2 );
mask_green_3.download(imgThresholded_green_3 , stream_download_green_3 );

Note that you are uploading 4 times 3 channels, while you are downloading 12 single channel images if I correctly understand.

Note that Jetson platforms have iGPU sharing memory with system CPUs. Cache and its coherency may behave differently depending on your platform, but with a single platform such as AGX Xavier, you may also try allocating buffers with pinned memory (no cache, useful for buffer I/O transfer) or unified memory (using cache, useful for processing), use opencv GpuMat with pre-allocated buffers and see.

1 Like

Thank you for your suggestion, it gave me an idea to solve the problem.

I used the HostMem function to solve the slow download problem.

Below is my modified code.

cuda::HostMem Result_host_0(tmp_0 , cuda::HostMem::SHARED);

/*
Do some image processing
*/

result_0.download(Result_host_0 , stream_download_0);

imgResult_0 = Result_host_0.createMatHeader();

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.