Using OpenCV cuda stream for parallel CPU and GPU execution

Alex66 · November 24, 2022, 8:05am

I tried to use opencv cuda functions with stream parameter for achieving parallel execution of cpu and gpu.
I wrote the test program and executed it on PC and Tegra TX2 with the same results. Test program attached
OpencvTest.cpp (2.0 KB)
This is a result of test:

**No Stream Time in msec 2**

**Stream Opencv Time in msec 7**
**Stream Complete Time in msec 7**
**Stream Opencv Time in msec 5**
**Stream Complete Time in msec 5**
**Stream Opencv Time in msec 6**
**Stream Complete Time in msec 6**
**Stream Opencv Time in msec 5**
**Stream Complete Time in msec 5**

I expected to receive “Stream Opencv Time” near to zero. Actually I saw that CPU blocked till opencv cuda functions didn’t finish they work.
Another issue - why non-stream version takes less time than stream version?

Cheloup · December 1, 2022, 9:49am

According to this article :

To enable asynchronous copies your destination memory must be pinned.

you can add a

cv::cuda::registerPageLocked(CpuDest);
cv::cuda::registerPageLocked(CpuSrc);

in your code to : accelerate copy between host and device, and enable asynchronous copies.

BTW : Don’t forget to call cv::cuda::unregisterPageLocked before leaving the program.

Alex66 · December 1, 2022, 3:37pm

I’m not looking for async memory copy. I’m looking for async kernel execution.
Event When I change code in the following way:

   GpuSrc.upload(CpuSrc, MyStream);
    unsigned long long StartTime = GetTimeMiliSec();
    cuda::warpAffine(GpuSrc, GpuDest, WarpTForm, GpuSrc.size(), cv::INTER_CUBIC, 0, 0, MyStream);
    unsigned long long MiddleTime = GetTimeMiliSec();
    GpuDest.download(CpuDest, MyStream);
    MyStream.waitForCompletion();
    unsigned long long EndTime = GetTimeMiliSec();

I received approximately the same result. I expected to receive MiddleTime some microseconds after Start time and not after 5 msec.
Looks like Opencv function blocks the execution thread until completion.

Cheloup · December 1, 2022, 5:04pm

So the only remaining explaination is maybe there is an synchronisation into OpenCV Code.

Try call a custom kernel instead of OpenCV, or check it from NVidia NSight System.

Topic		Replies	Views
My streams are not running concurrently CUDA Programming and Performance	7	1789	March 6, 2018
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1459	September 14, 2017
CUDA stream not fully utilize CUDA Programming and Performance	0	454	July 27, 2017
OpenCV CUDA Streams do not execute in parallel CUDA Programming and Performance opencv , cuda	2	2567	October 12, 2021
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1766	June 23, 2010
CUDA stream performance CUDA Programming and Performance	5	2382	July 23, 2013
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2707	April 29, 2019
Concurrent kernel execution without stream CUDA Programming and Performance	7	2456	December 28, 2016
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3076	January 19, 2018
Why OpenCV thresholding function is slower in GPU than CPU? CUDA Programming and Performance opencv	1	2745	September 17, 2018

Using OpenCV cuda stream for parallel CPU and GPU execution

Related topics