Performance difference between Jetpack and TensorRT versions

Hi,

I get very different inference timings from trtexec on 2 different Jetson Nano devices. They have different versions of Jetpack and TensorRT.

Device 1: Jetson Nano, Jetpack 4.4, TensorRT 7.1.3
Device 2: Jetson Nano, Jetpack 4.6, TensorRT 8.1.2

It seems like there is a problem with device 2. For any onnx model, trtexec on device 2 gives much larger times compared to device 1. The trtexec outputs of an onnx model are given below for both devices.

For Device 1:
[05/15/2023-10:23:54] [I] Host Latency
[05/15/2023-10:23:54] [I] min: 17.8914 ms (end to end 17.9038 ms)
[05/15/2023-10:23:54] [I] max: 36.7483 ms (end to end 38.3356 ms)
[05/15/2023-10:23:54] [I] mean: 24.061 ms (end to end 24.6089 ms)
[05/15/2023-10:23:54] [I] median: 19.6371 ms (end to end 20.1515 ms)
[05/15/2023-10:23:54] [I] percentile: 36.5361 ms at 99% (end to end 37.5812 ms at 99%)
[05/15/2023-10:23:54] [I] throughput: 40.6347 qps
[05/15/2023-10:23:54] [I] walltime: 3.02697 s
[05/15/2023-10:23:54] [I] Enqueue Time
[05/15/2023-10:23:54] [I] min: 6.39282 ms
[05/15/2023-10:23:54] [I] max: 11.272 ms
[05/15/2023-10:23:54] [I] median: 7.11914 ms
[05/15/2023-10:23:54] [I] GPU Compute
[05/15/2023-10:23:54] [I] min: 17.4719 ms
[05/15/2023-10:23:54] [I] max: 36.2471 ms
[05/15/2023-10:23:54] [I] mean: 23.6288 ms
[05/15/2023-10:23:54] [I] median: 19.182 ms
[05/15/2023-10:23:54] [I] percentile: 36.1172 ms at 99%
[05/15/2023-10:23:54] [I] total compute time: 2.90634 s

For Device 2:

[05/15/2023-12:41:32] [I] === Performance summary ===
[05/15/2023-12:41:32] [I] Throughput: 12.5568 qps
[05/15/2023-12:41:32] [I] Latency: min = 66.3296 ms, max = 198.232 ms, mean = 79.1126 ms, median = 67.4297 ms, percentile(99%) = 198.232 ms
[05/15/2023-12:41:32] [I] End-to-End Host Latency: min = 66.3831 ms, max = 210.906 ms, mean = 79.6359 ms, median = 67.4835 ms, percentile(99%) = 210.906 ms
[05/15/2023-12:41:32] [I] Enqueue Time: min = 7.81299 ms, max = 13.1021 ms, mean = 10.4438 ms, median = 10.4058 ms, percentile(99%) = 13.1021 ms
[05/15/2023-12:41:32] [I] H2D Latency: min = 3.33521 ms, max = 10.3575 ms, mean = 4.20191 ms, median = 3.6214 ms, percentile(99%) = 10.3575 ms
[05/15/2023-12:41:32] [I] GPU Compute Time: min = 61.6451 ms, max = 189.297 ms, mean = 73.8214 ms, median = 62.764 ms, percentile(99%) = 189.297 ms
[05/15/2023-12:41:32] [I] D2H Latency: min = 0.906616 ms, max = 1.24756 ms, mean = 1.08936 ms, median = 1.09595 ms, percentile(99%) = 1.24756 ms
[05/15/2023-12:41:32] [I] Total Host Walltime: 3.18552 s
[05/15/2023-12:41:32] [I] Total GPU Compute Time: 2.95285 s

Both devices are in MAXN mode and power suppliers are considered to be enough.

What may be the reason for this performance difference? Or, what can I do for troubleshooting?

Thanks.

Hi,

Just want to confirm first.
Do you use the same model for testing?

More, have you fixed the device clock to the maximal?

$ sudo jetson_clocks

Thanks.

Yes, I use the same model for testing.

And, the result didn’t change after jetson_clocks.

Thanks.

Edit:
When I check the jetson_clocks with --show argument, EMC frequency seems to stay at min, which is 204 MHz. Is this a problem? The consol output is given below.

ubuntu@ubuntu:~$ sudo jetson_clocks --show
SOC family:tegra210 Machine:NVIDIA Jetson Nano Developer Kit
Online CPUs: 0-3
cpu0: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0
cpu1: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0
cpu2: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0
cpu3: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0
GPU MinFreq=76800000 MaxFreq=921600000 CurrentFreq=921600000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=204000000 FreqOverride=1
NV Power Mode: MAXN

Hi,

The GPU is fixed to max so it should be okay (jetson_clocks).
Could you share the model with us so we can reproduce it in our environment?

Thanks.

Hi,

I will share a different model, but, the results are very similar.

GPU mean: 22ms (device 1)
GPU mean: 69ms (device 2)

mobilenetv2-7.onnx (13.6 MB)

Hi,

Could you also share the tegrastats for both devices with us?
More, have you reproduced the same issue with other newer GPU architecture?

Thanks.

Hi,

We detected the issue with device 2. The device was broken due to a wrong dtb file update.

Timings are all good after re-flashing the device.

Thank you.