Why Orin's trtexec produces much better inference from onnx model than from engine model?

Hi, there:

I’m using Orin 32GB eval system. I ran the trtexec command for the same model, but see very different throughput.

First, I ran it from onnx model and let it save the engine model. Then I ran directly with the engine model. Somehow, the second approach gives much worse throughput. Please advise. Thanks.

I’m attaching the details below.

1) Approach 1, from onnx model:
trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3dv2.3.k4.engine --exportProfile=c3dv2.3.k4.json --int8 --useDLACore=0 --allowGPUFallback --useSpinWait --separateProfileRun > c3dv2.3.k4.log
Result:
[08/29/2022-14:28:54] [I] Average on 10 runs - GPU latency: 9.62085 ms - Host latency: 9.94692 ms (enqueue 9.52153 ms)
[08/29/2022-14:28:54] [I] Average on 10 runs - GPU latency: 9.69641 ms - Host latency: 10.0225 ms (enqueue 9.52104 ms)
[08/29/2022-14:28:54] [I] Average on 10 runs - GPU latency: 9.62671 ms - Host latency: 9.95383 ms (enqueue 9.59409 ms)
[08/29/2022-14:28:54] [I]
[08/29/2022-14:28:54] [I] === Performance summary ===
[08/29/2022-14:28:54] [I] Throughput: 103.578 qps
[08/29/2022-14:28:54] [I] Latency: min = 9.87085 ms, max = 11.1741 ms, mean = 9.94917 ms, median = 9.93164 ms, percentile(99%) = 10.1422 ms
[08/29/2022-14:28:54] [I] Enqueue Time: min = 9.43918 ms, max = 10.1167 ms, mean = 9.52211 ms, median = 9.51367 ms, percentile(99%) = 9.72949 ms
[08/29/2022-14:28:54] [I] H2D Latency: min = 0.0930176 ms, max = 0.106934 ms, mean = 0.0958078 ms, median = 0.0957031 ms, percentile(99%) = 0.0991211 ms
[08/29/2022-14:28:54] [I] GPU Compute Time: min = 9.56708 ms, max = 10.8484 ms, mean = 9.62289 ms, median = 9.60498 ms, percentile(99%) = 9.81604 ms
[08/29/2022-14:28:54] [I] D2H Latency: min = 0.197021 ms, max = 0.251831 ms, mean = 0.230476 ms, median = 0.230408 ms, percentile(99%) = 0.234192 ms
[08/29/2022-14:28:54] [I] Total Host Walltime: 3.02188 s
[08/29/2022-14:28:54] [I] Total GPU Compute Time: 3.01196 s

2) From the saved engine model above :
trtexec --loadEngine=c3dv2.3.k4.engine --exportProfile=c3dv2.3.k4.json --int8 --useDLACore=0 --allowGPUFallback --useSpinWait --separateProfileRun
Result:
[08/29/2022-14:30:59] [I] Average on 10 runs - GPU latency: 14.1678 ms - Host latency: 14.6946 ms (enqueue 13.9329 ms)
[08/29/2022-14:30:59] [I] Average on 10 runs - GPU latency: 14.1273 ms - Host latency: 14.6502 ms (enqueue 13.8701 ms)
[08/29/2022-14:30:59] [I]
[08/29/2022-14:30:59] [I] === Performance summary ===
[08/29/2022-14:30:59] [I] Throughput: 74.0434 qps
[08/29/2022-14:30:59] [I] Latency: min = 13.72 ms, max = 15.6018 ms, mean = 13.9495 ms, median = 13.7749 ms, percentile(99%) = 15.5178 ms
[08/29/2022-14:30:59] [I] Enqueue Time: min = 4.10065 ms, max = 14.8457 ms, mean = 13.2267 ms, median = 13.1588 ms, percentile(99%) = 14.8369 ms
[08/29/2022-14:30:59] [I] H2D Latency: min = 0.112976 ms, max = 0.172302 ms, mean = 0.136312 ms, median = 0.132324 ms, percentile(99%) = 0.166992 ms
[08/29/2022-14:30:59] [I] GPU Compute Time: min = 13.2705 ms, max = 15.0791 ms, mean = 13.4866 ms, median = 13.3241 ms, percentile(99%) = 15.0662 ms
[08/29/2022-14:30:59] [I] D2H Latency: min = 0.312744 ms, max = 0.384277 ms, mean = 0.326589 ms, median = 0.317932 ms, percentile(99%) = 0.383545 ms
[08/29/2022-14:30:59] [I] Total Host Walltime: 2.24193 s
[08/29/2022-14:30:59] [I] Total GPU Compute Time: 2.23877 s

Hi,

Have you maximized the performance first?
Since Orin default uses dynamic frequency, it might affect the performance of extensive tasks.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

If you still observe a similar behavior after fixing the clock to the maximum, could you share the model for us to check?

Thanks.

Cool. Thanks for the quick reply. With these two commands, inference only command achieves the same result as the combo approach. For the combo approach, there is no improvement in throughput.

So combo approach boosts the performance behind the scene and set back to default after done. The inference only approach doesn’t try to boost the performance.

Hi,

Yes.
Under dynamic clock mode, the GPU clocks will rise if a complicated task is executed.

Thanks.

cmd: “sudo jetson_clocks --show” shows the current setting.

How to restore to default dynamic clock mode? I didn’t save earlier. Thanks.

Hi,

Please run the nvpmodel again.
When changing to the different power model, clocks will be set to dynamic range instead of fixing to the maximum.

Thanks.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.