I tried to use opencv cuda functions with stream parameter for achieving parallel execution of cpu and gpu.
I wrote the test program and executed it on PC and Tegra TX2 with the same results. Test program attached OpencvTest.cpp (2.0 KB)
This is a result of test:
**No Stream Time in msec 2**
**Stream Opencv Time in msec 7**
**Stream Complete Time in msec 7**
**Stream Opencv Time in msec 5**
**Stream Complete Time in msec 5**
**Stream Opencv Time in msec 6**
**Stream Complete Time in msec 6**
**Stream Opencv Time in msec 5**
**Stream Complete Time in msec 5**
I expected to receive “Stream Opencv Time” near to zero. Actually I saw that CPU blocked till opencv cuda functions didn’t finish they work.
Another issue - why non-stream version takes less time than stream version?
I’m not looking for async memory copy. I’m looking for async kernel execution.
Event When I change code in the following way:
GpuSrc.upload(CpuSrc, MyStream);
unsigned long long StartTime = GetTimeMiliSec();
cuda::warpAffine(GpuSrc, GpuDest, WarpTForm, GpuSrc.size(), cv::INTER_CUBIC, 0, 0, MyStream);
unsigned long long MiddleTime = GetTimeMiliSec();
GpuDest.download(CpuDest, MyStream);
MyStream.waitForCompletion();
unsigned long long EndTime = GetTimeMiliSec();
I received approximately the same result. I expected to receive MiddleTime some microseconds after Start time and not after 5 msec.
Looks like Opencv function blocks the execution thread until completion.