Hi everyone,
I´m doing some image processing tasks with my Jetson. For this I use OpenCV GPU HOGDescriptor with the GPU DetectMultiscale function. The Board is runting L4T 21.3 with OpenCV4tegra 2.4.10.1. All 4 Cores are active and running constantly with max. freq as well as the GPU.
Erverything looks fine, the speed-boost against the CPU HOG on the Jetson, as well as the detection results, only the processing time makes some problems.
In detail, for my application the 192 CUDA Cores needs 100ms for the GPU DetectMultiscale function. But this duration is not constant, as we usally can expect from the HOG algorithm. The duration is jumping around from 100ms up to 160ms, so up to 60% more.
If I run my application with an Intel i5 and GTX 960, the GPU processing time is almost constant, max. +5%.
Does this happen on the Jetson because it has only 1 SM Block, and so there might be much more overhead compared to the 8 SM´s of the GTX 960?
I´ve also disabled the L4T GUI and execute the application, there is no change.
Can someone test it with his own Jetson?
Here are the important code snips and facts:
- For the tests I used always the same picture
- The source picture is 1024x414
- I use the build-in getDaimlerPeopleDetector SVN
- p_HogWinHeight = 96
- p_HogScaleLevels = 15
- p_HogThreshold = 1.5
- p_HogMultiScalefac = 1.10
//** Initialize OpenCV GPU-HOG **
hog = gpu::HOGDescriptor(Size(p_HogWinHeight / 2, p_HogWinHeight), Size(16, 16), Size(8, 8), Size(8, 8), 9, -1.0, 0.2, 1, p_HogScaleLevels);
hog.gpu::HOGDescriptor::setSVMDetector(cv::HOGDescriptor::getDaimlerPeopleDetector());
...
gpuImg.upload(img_crop);
//** OpenCV Time measurement
TickMeter tm; tm.start(); //tic
hog.gpu::HOGDescriptor::detectMultiScale(gpuImgSmall, hogUnfiltered, p_HogThreshold, Size(8, 8), Size(0, 0), p_HogMultiScalefac, 2);
tm.stop(); cout << " Detector(ms): " << tm.getTimeMilli() << endl; //toc