AGX xavier CPU performance degradation with update from L4T 32.5 to L4T 35.5

eddygharib13 · September 24, 2024, 3:25pm

Hello all,

We have noticed a performance degradation issue when moving from an L4T 32.5 installation (ubuntu 18.04 with kernel 4.9) to L4T 35.5 (ubuntu 20.04, kernel 5.10) on our jetson AGX Xavier based industrial PC, which is also reproducible on a jetson AGX Xavier devkit (freshly installed)

We noticed that the scheduling of parallel execution tasks is worse on the newer release in a remarkable amount (2x to 7x depending on the load factor), and since we run our devices on the edge of execution performance, this largely affects our processes.

In order to prove our point, we created a small example openmp-test.cpp (741 Bytes)
that links to the openmp library and spins 100 threads performing basic copy operations, and compared it’s performance on the aforementioned installations flashed on the same devkit, and here are the results we got

Ubuntu 18.04: 150ms
Ubuntu 20.04: 460ms

Note that these tests are made on the xavier devkit with a fresh installation and no other “significant” load on the device

In practice, on our industrial machine and with our computation demanding application running on the device, this difference changes to the following numbers:

Ubuntu 18.04 (with running app): 150ms
Ubuntu 20.04 (without running app): 270ms
Ubuntu 20.04 (with running app): 850ms

So we can see clearly that there is something to do with the scheduling of the CPU executions

We tried some parametrizations (ulimits, sysctl modifications) to reduce this difference, but we weren’t able to find any solution for the problem

This is why we are wondering if this issue has been flagged before by any other users of the product, and whether you can guide us to the configurations we can make to move towards the same performance we had with L4T 32

Note: The file can be compiled and linked against openmp using clang++ openmp-test.cpp -fopenmp

Thank you for your support

AastaLLL · September 25, 2024, 6:22am

Hi,

Have you maximized the device performance before testing?

There is a known perf drop from kernel 4.9 to 5.10 since the security hardening.
But the ratio should be around 1.x rather than 2x-7x.

Thanks.

eddygharib13 · October 2, 2024, 3:46pm

Hello AastaLLL,

If by that you mean setting nvpmodel to 0, then yes, we did that on all devices before testing

sudo nvpmodel  -q
NV Power Mode: MAXN
0

AastaLLL · October 14, 2024, 5:15am

Hi,

Have you also run the jetson_clocks to fix the clocks to the maximum?
We are going to reprouce this in our environment and will provide more info to you later.

Thanks.

AastaLLL · October 16, 2024, 6:26am

Hi,

Confirmed that we can reproduce this issue locally.
Our internal team is actively working on this.

Will let you know once we get any feedback.
Thanks.

AastaLLL · October 23, 2024, 10:22am

Hi,

Memory-management performance with multi-threaded workloads is known to be bounded by the operation to acquire the mmap_sem reader/writer semaphore[1][2], so we don’t suggest this to be the scenario to validate the memory bandwidth on the platform. Additionally, we checked memory bandwidth for multi-process on lmbench and found no performance degradation. As a result, this should not be a platform issue.

Besides, we found that the cost of data aborts/page fault “el0_da()” can be the hotspot to impact the performance. Since el0_da() is changed from assembly to C code[3] and the memory/lock subsystem becomes more complicated from 4.9 to L5.10, the time consumption may be increased. When el0_da() gets called frequently, this increase is amplified.

It’s hard for us to revert the changes from L5.10 to L4.9.
Instead, we suggest improving the performance from userspace by reducing the potential page faults: for example, you can avoid initializing the memory in different threads if possible. Taking the case of your source as an example, we can make changes like the following. This can reduce page fault occurrences and lead to significant time efficiency. The time spent on futex() can also be optimized because we reduce the lock contention in this way.

+ std::vector<size_t> input;
+ input.assign(100000 * 100, 200);
+ std::vector<size_t> output;
+ output.assign(100000 * 100, 3);
#pragma omp parallel num_threads(100)
  {
-    std::vector<size_t> input;
-    input.assign(100000, 200);
-    std::vector<size_t> output;
-    output.assign(100000, 3);
+    int t_num = omp_get_thread_num();
#pragma omp for schedule(static) reduction(+ : count)
-    for (size_t i = 0; i < input.size(); i++) {
-      output[i] = input[i];
+    for (size_t i = 0; i < 100000; i++) {
+      output[t_num * 100000 + i] = input[100000 + i];
       ++count;
    }
  }

Before

Time taken by function: 388029 microseconds
            39,500      page-faults                                                   ( +-  0.00% )
           0.42240 +- 0.00133 seconds time elapsed  ( +-  0.31% )

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.50    0.151802       30360         5           futex
 16.89    0.032653       16326         2           munmap
  2.41    0.004670          47        99           clone
  1.12    0.002167          21       101           mmap
  1.00    0.001942          19        99           mprotect
  0.05    0.000094          94         1           write
  0.01    0.000019           9         2           clock_gettime
  0.01    0.000017          17         1           ioctl
  0.01    0.000014           7         2           fstat
  0.00    0.000000           0         1           openat
  0.00    0.000000           0         1           close
  0.00    0.000000           0         2           getdents64
  0.00    0.000000           0         1           readlinkat
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         4           brk
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    0.193378                   330           total

After

Time taken by function: 65014 microseconds

             1,037      page-faults                                                   ( +-  7.53% )
          0.082508 +- 0.000893 seconds time elapsed  ( +-  1.08% )

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 32.97    0.004715          47        99           clone
 20.54    0.002937         734         4           futex
 13.76    0.001968         984         2           munmap
 13.73    0.001964          19       101           mmap
 13.34    0.001907          19        99           mprotect
  1.63    0.000233         116         2           write
  0.87    0.000124         124         1           readlinkat
  0.64    0.000092          46         2           getdents64
  0.47    0.000067          16         4           brk
  0.43    0.000062          62         1           openat
  0.27    0.000039          19         2           fstat
  0.20    0.000028          28         1           ioctl
  0.19    0.000027          13         2           clock_gettime
  0.19    0.000027          13         2           rt_sigaction
  0.15    0.000021          21         1           close
  0.12    0.000017          17         1           uname
  0.12    0.000017          17         1           prlimit64
  0.11    0.000016          16         1           sched_getaffinity
  0.09    0.000013          13         1           set_tid_address
  0.09    0.000013          13         1           set_robust_list
  0.09    0.000013          13         1           rt_sigprocmask
  0.00    0.000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00    0.014300                   330           total

[1] https://lwn.net/Articles/730531/
[2] The LRU lock and mmap_sem [LWN.net]
[3] [PATCH v2 4/7] arm64: entry: convert el1_sync to C - James Morse

Thanks.

system · November 20, 2024, 3:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CPU performance is worse on the Xavier then the TX2 Jetson AGX Xavier	9	2255	October 18, 2021
CPU performance problem on Jetson TX1 Jetson TX1	23	3264	October 18, 2021
Installed HighPoint SSD7505 PCie 4.0 x16 On Xavier AGX get less than x8 lanes performance Jetson AGX Xavier pcie	34	1971	October 18, 2021
Jetson AGX slower than TX2 Jetson AGX Xavier	7	1002	May 22, 2019
AGX Xavier compile code slowly Jetson AGX Xavier opencv , cuda	11	426	October 18, 2021
Performance issues when upgrading to JetPack 5 Jetson Xavier NX jetpack , performance	12	269	October 23, 2024
Bad memory performance on JetPack 5.0.2 (5.10.104-tegra) Jetson Xavier NX nvbugs , performance	14	1261	July 16, 2024
Can't run nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime on Orin AGX Jetson AGX Orin tensorrt	19	1207	May 13, 2022
Severe performance regression on L4T32 Jetson TX2 performance	4	73	March 5, 2025
TX2 Computing Performance has Dropped Jetson TX2 power , performance	12	990	October 18, 2021

AGX xavier CPU performance degradation with update from L4T 32.5 to L4T 35.5

Before

After

Related topics