Severe performance regression on L4T32

Hi. I’m observing a severe performance regression on L4T-32 compared to L4T-28.

We’re observing both higher latencies and lower throughput in our camera and AI pipeline. The regressions appear to be memory-related and are reproducible using the open-source stress-ng tool. Github link to stress-ng.

Test 1:

Instructions:

stress-ng --mq 0 -t 60s --times --metrics

On L4T-28 running Linux 4.4:

stress-ng: info:  [31158] setting to a 60 second run per stressor
stress-ng: info:  [31158] dispatching hogs: 6 mq
stress-ng: metrc: [31158] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [31158]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [31158] mq             50245838     60.00     46.58    284.40    837410.30      151809.29        91.94          2148
stress-ng: info:  [31158] for a 60.01s run time:
stress-ng: info:  [31158]     360.05s available CPU time
stress-ng: info:  [31158]      46.57s user time   ( 12.93%)
stress-ng: info:  [31158]     284.40s system time ( 78.99%)
stress-ng: info:  [31158]     330.97s total time  ( 91.92%)
stress-ng: info:  [31158] load average: 8.39 6.40 4.29
stress-ng: info:  [31158] passed: 6: mq (6)
stress-ng: info:  [31158] failed: 0
stress-ng: info:  [31158] skipped: 0
stress-ng: info:  [31158] successful run completed in 60.01s (1 min, 0.01 secs)

On L4T-32 running Linux 4.9:

stress-ng: info:  [9546] setting to a 60 second run per stressor
stress-ng: info:  [9546] dispatching hogs: 6 mq
stress-ng: metrc: [9546] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [9546]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [9546] mq             25879366     60.00     45.65    292.86    431303.25       76452.17        94.02          4540
stress-ng: info:  [9546] for a 60.01s run time:
stress-ng: info:  [9546]     360.05s available CPU time
stress-ng: info:  [9546]      45.65s user time   ( 12.68%)
stress-ng: info:  [9546]     292.85s system time ( 81.34%)
stress-ng: info:  [9546]     338.50s total time  ( 94.01%)
stress-ng: info:  [9546] load average: 6.46 3.51 2.05
stress-ng: info:  [9546] passed: 6: mq (6)
stress-ng: info:  [9546] failed: 0
stress-ng: info:  [9546] skipped: 0
stress-ng: info:  [9546] successful run completed in 60.01s (1 min, 0.01 secs)

Test 2:

Instructions:

stress-ng --matrix 0 --mq 0 -t 60s --times --metrics

On L4T-28 running Linux 4.4:

stress-ng: info:  [30925] setting to a 60 second run per stressor
stress-ng: info:  [30925] dispatching hogs: 6 matrix, 6 mq
stress-ng: metrc: [30925] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [30925]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [30925] matrix           107798     60.00    155.03      3.75      1796.61         678.93        44.10          2440
stress-ng: metrc: [30925] mq             29285865     60.00     25.57    173.98    488091.91      146761.00        55.43          2136
stress-ng: metrc: [30925] miscellaneous metrics:
stress-ng: metrc: [30925] matrix             19857.62 add matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             63710.40 copy matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix              9410.29 div matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             10761.43 frobenius matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             23314.67 hadamard matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             23050.92 identity matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             14549.59 mean matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             24576.44 mult matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             41438.12 negate matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix                36.58 prod matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             22669.26 sub matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix                53.47 square matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix              6864.56 trans matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix             47455.58 zero matrix ops per sec (geometric mean of 6 instances)
stress-ng: info:  [30925] for a 60.04s run time:
stress-ng: info:  [30925]     360.24s available CPU time
stress-ng: info:  [30925]     585.93s user time   (162.65%)
stress-ng: info:  [30925]     462.16s system time (128.29%)
stress-ng: info:  [30925]    1048.09s total time  (290.94%)
stress-ng: info:  [30925] load average: 12.04 7.79 4.93
stress-ng: info:  [30925] passed: 12: matrix (6) mq (6)
stress-ng: info:  [30925] failed: 0
stress-ng: info:  [30925] skipped: 0
stress-ng: info:  [30925] successful run completed in 60.04s (1 min, 0.04 secs)

On L4T-32 running Linux 4.9:

stress-ng: info:  [9269] setting to a 60 second run per stressor
stress-ng: info:  [9269] dispatching hogs: 6 matrix, 6 mq
stress-ng: metrc: [9269] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [9269]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [9269] matrix            90698     60.00    186.63      2.73      1511.61         478.97        52.60          6600
stress-ng: metrc: [9269] mq              6542156     60.00     22.94    144.27    109034.80       39126.79        46.45          4332
stress-ng: metrc: [9269] miscellaneous metrics:
stress-ng: metrc: [9269] matrix             19944.22 add matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             42091.78 copy matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix              7917.28 div matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             13342.23 frobenius matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             28644.34 hadamard matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             25337.66 identity matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             26826.86 mean matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             40908.06 mult matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             45358.49 negate matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix                35.29 prod matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix             21534.56 sub matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix                37.78 square matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix              4756.33 trans matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix            131500.26 zero matrix ops per sec (geometric mean of 6 instances)
stress-ng: info:  [9269] for a 60.03s run time:
stress-ng: info:  [9269]     360.20s available CPU time
stress-ng: info:  [9269]     612.75s user time   (170.12%)
stress-ng: info:  [9269]     439.94s system time (122.14%)
stress-ng: info:  [9269]    1052.69s total time  (292.25%)
stress-ng: info:  [9269] load average: 10.30 5.16 2.73
stress-ng: info:  [9269] passed: 12: matrix (6) mq (6)
stress-ng: info:  [9269] failed: 0
stress-ng: info:  [9269] skipped: 0
stress-ng: info:  [9269] successful run completed in 60.03s (1 min, 0.03 secs)

The “bogo ops” on L4T-32 are significantly lower compared to those on L4T-28.

I understand that stress-ng is a synthetic benchmark, but it aligns well with our real-world usecase.

We’re seeing the regressions across several L4T-32 releases, but the attached logs are specifically from L4T 32.7.4.


Is there any advice on how to improve performance for these tests?

Thanks,
Robb

Do you run try boost the system to performance to compare.

sudo nvpmodel -m 0
sudo jetson_clocks

Hi. Yes, we have tried those settings. We are operating in Max-N mode with the clocks maxed.

disabling below configs to try.
CONFIG_MITIGATE_SPECTRE_BRANCH_HISTORY=n
CONFIG_HARDEN_BRANCH_PREDICTOR=n

The configuration MITIGATE_SPECTRE_BRANCH_HISTORY is enabled to address security vulnerabilities known as Spectre and Meltdown. These vulnerabilities exploit speculative execution and caches in modern processors, which can lead to security issues.

If CPU usage is a major concern and has higher priority than security, probably it is an option

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.