Orin low performance on mobilnetv1 ssd

I did some benchmark on Orin dev kit with GitHub - NVIDIA-AI-IOT/jetson_benchmarks: Jetson Benchmark
power mode MAXN
TensorRT 8.4 in Jetson Pack 5
The throughput is less than 300fps. Xavier NX is 800fps and Xavier AGX is 1,500fps.
What went wrong? Please advise.

/usr/src/tensorrt/bin/trtexec --onnx=/home/andy/nvidia/jetson_benchmarks/models/ssd-mobilenet-v1-bs16.onnx --useSpinWait --useCudaGraph --int8 --workspace=4096 --avgRuns=100 --duration=180

[05/06/2022-20:57:24] [I] === Performance summary ===
[05/06/2022-20:57:24] [I] Throughput: 296.952 qps
[05/06/2022-20:57:24] [I] Latency: min = 4.01562 ms, max = 12.0008 ms, mean = 4.78139 ms, median = 4.70898 ms, percentile(99%) = 6.21875 ms
[05/06/2022-20:57:24] [I] Enqueue Time: min = 0 ms, max = 0.867188 ms, mean = 0.0241754 ms, median = 0.0200195 ms, percentile(99%) = 0.0664062 ms
[05/06/2022-20:57:24] [I] H2D Latency: min = 0.503906 ms, max = 3.08443 ms, mean = 1.1139 ms, median = 1.13281 ms, percentile(99%) = 1.1875 ms
[05/06/2022-20:57:24] [I] GPU Compute Time: min = 3.20312 ms, max = 10.6925 ms, mean = 3.36628 ms, median = 3.27734 ms, percentile(99%) = 4.72656 ms
[05/06/2022-20:57:24] [I] D2H Latency: min = 0.140625 ms, max = 0.869156 ms, mean = 0.300852 ms, median = 0.300781 ms, percentile(99%) = 0.3125 ms
[05/06/2022-20:57:24] [I] Total Host Walltime: 180.009 s
[05/06/2022-20:57:24] [I] Total GPU Compute Time: 179.941 s
[05/06/2022-20:57:24] [W] * GPU compute time is unstable, with coefficient of variance = 9.40032%.
[05/06/2022-20:57:24] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.


Have you maximized the device performance first?

$ sudo jetson_clocks

We are going to reproduce this issue internally.
Will share more results from our side with you later.



We test the same command on Xavier and got the 100 qps.
It seems that Orin still outputs much better performance.

Could you share how you get the 1,500fps on the AGX Xavier?

@AastaLLL , Compare NVIDIA Jetson Xavier NX with Jetson TX2 Developer Kits - Latest Open Tech From Seeed I found Xavier performance number from this link.


Please noted that the throughput shared on the page has taken the two extra DLAs into account.

More, the unit in the table is frame but the TensorRT qps is calculated by the inference frequency.
In this use case, the model has batch size 16, which means 16 images can be done per inference.


What do you mean by GPU + 2 extra DLA into account? I want to understand the scenario/scheduling that you test the networks:

1- GPU + 2 DLA runs the same network with different execution at the same time. You add the throughput at the end?
2- GPU + 2 DLA run the same network with different executions on different time intervals, so that there is no contention. You add the throughput at the end?
3- Any other possible scenario?

Any clarification would be greatly appreciated.

HI Splendor027,

Please help to open a new topic for your issue. Thanks