Hi. I’m observing a severe performance regression on L4T-32 compared to L4T-28.
We’re observing both higher latencies and lower throughput in our camera and AI pipeline. The regressions appear to be memory-related and are reproducible using the open-source stress-ng tool. Github link to stress-ng.
Test 1:
Instructions:
stress-ng --mq 0 -t 60s --times --metrics
On L4T-28 running Linux 4.4:
stress-ng: info: [31158] setting to a 60 second run per stressor
stress-ng: info: [31158] dispatching hogs: 6 mq
stress-ng: metrc: [31158] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [31158] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [31158] mq 50245838 60.00 46.58 284.40 837410.30 151809.29 91.94 2148
stress-ng: info: [31158] for a 60.01s run time:
stress-ng: info: [31158] 360.05s available CPU time
stress-ng: info: [31158] 46.57s user time ( 12.93%)
stress-ng: info: [31158] 284.40s system time ( 78.99%)
stress-ng: info: [31158] 330.97s total time ( 91.92%)
stress-ng: info: [31158] load average: 8.39 6.40 4.29
stress-ng: info: [31158] passed: 6: mq (6)
stress-ng: info: [31158] failed: 0
stress-ng: info: [31158] skipped: 0
stress-ng: info: [31158] successful run completed in 60.01s (1 min, 0.01 secs)
On L4T-32 running Linux 4.9:
stress-ng: info: [9546] setting to a 60 second run per stressor
stress-ng: info: [9546] dispatching hogs: 6 mq
stress-ng: metrc: [9546] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [9546] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [9546] mq 25879366 60.00 45.65 292.86 431303.25 76452.17 94.02 4540
stress-ng: info: [9546] for a 60.01s run time:
stress-ng: info: [9546] 360.05s available CPU time
stress-ng: info: [9546] 45.65s user time ( 12.68%)
stress-ng: info: [9546] 292.85s system time ( 81.34%)
stress-ng: info: [9546] 338.50s total time ( 94.01%)
stress-ng: info: [9546] load average: 6.46 3.51 2.05
stress-ng: info: [9546] passed: 6: mq (6)
stress-ng: info: [9546] failed: 0
stress-ng: info: [9546] skipped: 0
stress-ng: info: [9546] successful run completed in 60.01s (1 min, 0.01 secs)
Test 2:
Instructions:
stress-ng --matrix 0 --mq 0 -t 60s --times --metrics
On L4T-28 running Linux 4.4:
stress-ng: info: [30925] setting to a 60 second run per stressor
stress-ng: info: [30925] dispatching hogs: 6 matrix, 6 mq
stress-ng: metrc: [30925] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [30925] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [30925] matrix 107798 60.00 155.03 3.75 1796.61 678.93 44.10 2440
stress-ng: metrc: [30925] mq 29285865 60.00 25.57 173.98 488091.91 146761.00 55.43 2136
stress-ng: metrc: [30925] miscellaneous metrics:
stress-ng: metrc: [30925] matrix 19857.62 add matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 63710.40 copy matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 9410.29 div matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 10761.43 frobenius matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 23314.67 hadamard matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 23050.92 identity matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 14549.59 mean matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 24576.44 mult matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 41438.12 negate matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 36.58 prod matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 22669.26 sub matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 53.47 square matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 6864.56 trans matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [30925] matrix 47455.58 zero matrix ops per sec (geometric mean of 6 instances)
stress-ng: info: [30925] for a 60.04s run time:
stress-ng: info: [30925] 360.24s available CPU time
stress-ng: info: [30925] 585.93s user time (162.65%)
stress-ng: info: [30925] 462.16s system time (128.29%)
stress-ng: info: [30925] 1048.09s total time (290.94%)
stress-ng: info: [30925] load average: 12.04 7.79 4.93
stress-ng: info: [30925] passed: 12: matrix (6) mq (6)
stress-ng: info: [30925] failed: 0
stress-ng: info: [30925] skipped: 0
stress-ng: info: [30925] successful run completed in 60.04s (1 min, 0.04 secs)
On L4T-32 running Linux 4.9:
stress-ng: info: [9269] setting to a 60 second run per stressor
stress-ng: info: [9269] dispatching hogs: 6 matrix, 6 mq
stress-ng: metrc: [9269] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [9269] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [9269] matrix 90698 60.00 186.63 2.73 1511.61 478.97 52.60 6600
stress-ng: metrc: [9269] mq 6542156 60.00 22.94 144.27 109034.80 39126.79 46.45 4332
stress-ng: metrc: [9269] miscellaneous metrics:
stress-ng: metrc: [9269] matrix 19944.22 add matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 42091.78 copy matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 7917.28 div matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 13342.23 frobenius matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 28644.34 hadamard matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 25337.66 identity matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 26826.86 mean matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 40908.06 mult matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 45358.49 negate matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 35.29 prod matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 21534.56 sub matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 37.78 square matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 4756.33 trans matrix ops per sec (geometric mean of 6 instances)
stress-ng: metrc: [9269] matrix 131500.26 zero matrix ops per sec (geometric mean of 6 instances)
stress-ng: info: [9269] for a 60.03s run time:
stress-ng: info: [9269] 360.20s available CPU time
stress-ng: info: [9269] 612.75s user time (170.12%)
stress-ng: info: [9269] 439.94s system time (122.14%)
stress-ng: info: [9269] 1052.69s total time (292.25%)
stress-ng: info: [9269] load average: 10.30 5.16 2.73
stress-ng: info: [9269] passed: 12: matrix (6) mq (6)
stress-ng: info: [9269] failed: 0
stress-ng: info: [9269] skipped: 0
stress-ng: info: [9269] successful run completed in 60.03s (1 min, 0.03 secs)
The “bogo ops” on L4T-32 are significantly lower compared to those on L4T-28.
I understand that stress-ng is a synthetic benchmark, but it aligns well with our real-world usecase.
We’re seeing the regressions across several L4T-32 releases, but the attached logs are specifically from L4T 32.7.4.
Is there any advice on how to improve performance for these tests?
Thanks,
Robb