Using OpenCV cuda stream for parallel CPU and GPU execution

I tried to use opencv cuda functions with stream parameter for achieving parallel execution of cpu and gpu.
I wrote the test program and executed it on PC and Tegra TX2 with the same results. Test program attached
OpencvTest.cpp (2.0 KB)
This is a result of test:

**No Stream Time in msec 2**

**Stream Opencv Time in msec 7**
**Stream Complete Time in msec 7**
**Stream Opencv Time in msec 5**
**Stream Complete Time in msec 5**
**Stream Opencv Time in msec 6**
**Stream Complete Time in msec 6**
**Stream Opencv Time in msec 5**
**Stream Complete Time in msec 5**

I expected to receive “Stream Opencv Time” near to zero. Actually I saw that CPU blocked till opencv cuda functions didn’t finish they work.
Another issue - why non-stream version takes less time than stream version?

According to this article :

To enable asynchronous copies your destination memory must be pinned.

you can add a


in your code to : accelerate copy between host and device, and enable asynchronous copies.

BTW : Don’t forget to call cv::cuda::unregisterPageLocked before leaving the program.

I’m not looking for async memory copy. I’m looking for async kernel execution.
Event When I change code in the following way:

   GpuSrc.upload(CpuSrc, MyStream);
    unsigned long long StartTime = GetTimeMiliSec();
    cuda::warpAffine(GpuSrc, GpuDest, WarpTForm, GpuSrc.size(), cv::INTER_CUBIC, 0, 0, MyStream);
    unsigned long long MiddleTime = GetTimeMiliSec();, MyStream);
    unsigned long long EndTime = GetTimeMiliSec();

I received approximately the same result. I expected to receive MiddleTime some microseconds after Start time and not after 5 msec.
Looks like Opencv function blocks the execution thread until completion.

So the only remaining explaination is maybe there is an synchronisation into OpenCV Code.

Try call a custom kernel instead of OpenCV, or check it from NVidia NSight System.