Using OpenCV GPU HOG with Jetson TK1

Hi everyone,

I´m doing some image processing tasks with my Jetson. For this I use OpenCV GPU HOGDescriptor with the GPU DetectMultiscale function. The Board is runting L4T 21.3 with OpenCV4tegra All 4 Cores are active and running constantly with max. freq as well as the GPU.

Erverything looks fine, the speed-boost against the CPU HOG on the Jetson, as well as the detection results, only the processing time makes some problems.

In detail, for my application the 192 CUDA Cores needs 100ms for the GPU DetectMultiscale function. But this duration is not constant, as we usally can expect from the HOG algorithm. The duration is jumping around from 100ms up to 160ms, so up to 60% more.

If I run my application with an Intel i5 and GTX 960, the GPU processing time is almost constant, max. +5%.

Does this happen on the Jetson because it has only 1 SM Block, and so there might be much more overhead compared to the 8 SM´s of the GTX 960?

I´ve also disabled the L4T GUI and execute the application, there is no change.

Can someone test it with his own Jetson?

Here are the important code snips and facts:

  • For the tests I used always the same picture
  • The source picture is 1024x414
  • I use the build-in getDaimlerPeopleDetector SVN
  • p_HogWinHeight = 96
  • p_HogScaleLevels = 15
  • p_HogThreshold = 1.5
  • p_HogMultiScalefac = 1.10
//** Initialize OpenCV GPU-HOG **
hog = gpu::HOGDescriptor(Size(p_HogWinHeight / 2, p_HogWinHeight), Size(16, 16), Size(8, 8), Size(8, 8), 9, -1.0, 0.2, 1, p_HogScaleLevels);




//** OpenCV Time measurement
TickMeter tm; tm.start(); //tic

hog.gpu::HOGDescriptor::detectMultiScale(gpuImgSmall, hogUnfiltered, p_HogThreshold, Size(8, 8), Size(0, 0), p_HogMultiScalefac, 2);

tm.stop(); cout << " Detector(ms): " << tm.getTimeMilli() << endl; //toc


The GPU in Tegra K1 is almost identical to a desktop Kepler GPU, but the CPU and memory subsystem is completely different, so it should be expected that you’ll see different performance behavior in desktop vs mobile.

I’m not sure what is effecting your performance issue specifically, but note that Tegra K1 has many levels of speed & voltage & temperature throttling, so it is common that measured speeds on Tegra vary over time, even if you do the exact same operation continuously. There are various ways to make the timing more consistent on Tegra K1 (see



thanks for your answer. One of the first things I created was a little script to unleash the full power of K1, with the help of the Wiki in your post. I also have a little GUI monitoring clocks and temperatures. So clocks are always at max.

With the Profiler application I monitored GPU activity and running tasks over time. The used CUDA functions from the in OpenCV (for example: void compute_hists_kernel_many_blocks or normalize_hists_kernel_many_blocks …) have always the same calculation time. The thing which is different between the frames is the cudaDeviceSynchronize task. This tasks duration is dramatically fluctuating (from 500us up to over 4 ms). Using 15 Multiscale Stages for the HOG this seems to be the problem.

The OpenCV CUDA HOG is written very general. I think 1 SM with (192 Cores) is not the preffered plattform for it. I´m not sure if its worth the effort to optimize the source code especially for K1. Tegra X1 with its 2SM and (128 Cores per SM) can handle this better. Hope that there will be a Jetson Board with Tegra X1 soon!