OpenCV-Cuda functions running far slower than expected on Jetson Xavier NX

felix29 · August 10, 2022, 4:39pm

Hi,
I just transfered the code I have been working on from my laptop to the Jetson Xavier NX dev kit (in 20W 6 cores mode).

I was expecting some slow-down, both on CPU and GPU code, but no more than a factor 2-3 at worst.

On the sections of the code making use of GPU (CUDA-OpenCV functions), I noticed a far bigger slow-down than expected (factors 5.5 to 14), which I can’t explain.

Do you have any idea why GPU code is so slow on the Jetson?

The timings :

function                 avg time on laptop      avg time on Jetson      ratio

cv::cuda::remap
(undistortion and             1.14ms                     6.39ms           5.6
setero recification)

optical flow                  0.66ms                     6.72ms           10.2
(no up/downloads)

some uploads and              0.03ms                     0.42ms            14.0
downloads

cv::cuda::goodFeaturesToMatch 2.23ms                     11.03ms           4.9

The GPU configuations :

GPU comparison	                             PC	            Jetson
CUDA Driver Version / Runtime Version     11.7 / 11.7	  11.4 / 11.4
CUDA Capability Major/Minor version number	 7.5	          7.2
Total amount of global memory	            3912 MB        14908 MB
Number of CUDA Cores :                       1024	          384
GPU Max Clock rate	                        1.25 GHz	     1.1 GHz
Memory Clock rate	                        3501 Mhz	    1109 MHz
Memory Bus Width	                        128-bit	        256-bit
L2 Cache Size	                          1048576 Bytes	    524288 Bytes
Maximum Texture Dimension Size (x,y,z)             identical
Maximum Layered 1D Texture Size, (num) layers  	   identical	
Maximum Layered 2D Texture Size, (num) layers  	   identical
Total amount of constant memory	                   identical
Total amount of shared memory per block	           identical
Total shared memory per multiprocessor	    65536 bytes	     98304 bytes
Total number of registers available per block	   identical
Warp size	                                       identical
Maximum number of threads per multiprocessor	  1024	        2048
Maximum number of threads per block	               identical
Max dimension size of a thread block (x,y,z)	   identical
Max dimension size of a grid size    (x,y,z)	   identical
Maximum memory pitch	                           identical
Texture alignment	                               identical
Concurrent copy and kernel execution	Yes, 3 copy engine(s)	Yes, 1 copy engine(s)

So if the speed is compute limited, I would expect a slowdown of a factor at most around 3 (3 time less cores, 10% slower GPU clock).
If the speed is limited by memory access time, then I would expect a slowdown between 1.5 an 3 (maybe 6) :

memory clock is 3 times slower
but bus width is 2 times wider (so up to 2x faster) → I suppose probably 1.5 times slower when data is continuous, 3 time slower for random access
cache is 2 time smaller, so maybe another factor 2
So I would suppose at most a factor 6 slowdown for memory access (probably rather around 3 times slower)

And I suppose that computation slowdown and memory slowdown are not cumulative, so a maximal global slowdown of a factor 3 (worst case 6).

So how is it possible that I get a factor 10.2 slowdown for the optical flow function?

Thanks a lot in advance

AastaLLL · August 11, 2022, 2:29am

Hi,

Would you mind testing this on MaxN mode instead?
More, maybe you can try our VPI library for better performance:
https://docs.nvidia.com/vpi/

Thanks.

felix29 · August 11, 2022, 8:07am

Hi,
For the MaxN mode, how do I access it on a Jetson Xavier NX? I see only the following modes :
0: 15W 2 cores
1: 15W 4 cores
2: 15W 6 cores
3: 10W 2 cores
4: 10W 4 cores
5: 10W DESKTOP
6: 20W 2 cores
7: 20W 4 cores
8: 20W 6 cores

So how do I get MaxN mode (this post even suggest there is no MaxN mode on the xavier NX : Jetson NX Power Modes - no MAXN for max. Performance?)? Or else, how to get full GPU power?

For the VPI library, we a considering switching to it later, but for now using “only” CUDA acceleration enables to do development directly on a (far more powerfull) laptop, which is easier for development purposes (there is still plenty of development to do).

Thanks a lot in advance

felix29 · August 11, 2022, 10:19am

I dug a bit into power modes :

running the “Jetson Power GUI”, I realized that the GPU is far from running at 1.1 GHz as announced, but running between 300 and 600 MHz instead (the maximum I observed for short duration was 800MHz)
I checked that nothing justifies a throttling (GPU temperature 41°C, total power 8W, GPU load <60%)
in the nvpmodel.com, for my power profile, GPU MIN_FREQ is 0, GPU MAX_FREQ is 1.1GHz.
I tried creating a custom power profile, senting GPU MIN_FREQ=GPU MAX_FREQ=1.1GHz, but the GPU frequency is still arround 500MHz !!!

So how can I force GPU to run at maximal frequency?

Thanks a lot in advance

EDIT: I found how to get the GPU running at maximal speed all the time : you just have to run the jeston_clocks command (it is still strange that without it the GPU MIN_FREQ gets ignored)

The new timings (with jetson_clocks):

function                 avg time on laptop      avg time on Jetson      ratio

cv::cuda::remap
(undistortion and             1.14ms                     4.22ms           3.7
setero recification)

optical flow                  0.66ms                     3.52ms           5.3
(no up/downloads)

some uploads and              0.03ms                     0.31ms            10.3
downloads

cv::cuda::goodFeaturesToMatch 2.23ms                     6.5ms           2.9

AastaLLL · August 12, 2022, 3:52am

Hi,

Sorry that there is no MaxN mode for XavierNX.
20W (id=6) should be the performance mode.
https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3271/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/power_management_jetson_xavier.html#wwpID0E0YO0HA

By default, Jetson uses the dynamic frequency.
So you will need to run the jetson_clocks to fix the frequency to the maximum.

For example:

$ sudo nvpmodel -m 6
$ sudo jetson_clocks

It seems that the latest benchmark score is much better.
Does it meet your expectation?

Thanks.

felix29 · August 12, 2022, 7:08am

Hi,
Thanks,
the nvpmodel -m 6 + jetson_clocks was what “solved” the slow GPU clock (excepted that I used the GUI to select the 20W mode).

For the last benchmark score, it is already far better. The fact that times were divded by about 2 seems to prove that the limitation is on the computation power.
In particular, it is very clear for the optical flow, which went from 6.72 to 3.52 (ie divided by 1.91).
However, for optical flow, I’m still 5.3 times slower than on the laptop, while I have about 3 time less cores (384 instead of 1024) and 10% less GPU clock (1.1GHz instead of 1.25GHz), so I would expect a difference of at most a factor (1024/384)*(1.25/1.1)=3.03. So this function is still 66% slower than expected! And I have no idea why that might be so.

AastaLLL · August 18, 2022, 5:42am

Hi,

Could you monitor the device with tegrastats as well?

$ sudo tegrastats

Please check the GPU utilization which is represented as GR3D_FREQ xx%@1109.
Thanks.

system · September 7, 2022, 4:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.