Inference slow even using TensorRT

Hi,

I’m using a YOLOX detector. This one was optimized on my computer (Nvidia RTX 3080) using TRT (8.6.1) (torch2trt) and achieved around 0.006s inference time.
The same code on Jetson Orin AGX, optimized again on TRT (8.5.2) only achieved around 0.035s inference time, which lowers the FPS a lot.

I am using same versions of Torch (12.1) and torchvision (0.16) on both machines.
I also installed Cuda compatible OpenCV to boost image processings. Even if that has nothing to do with inference time.

I’ve launched these 2 commands before the TRT optimisations and before benchmarking.

sudo nvpmodel -m 0
sudo jetson_clocks

Can the reason for my much higher Orin inference time be the TensorRT version that I have on the Orin which is lower than the one I have on my computer ?

JP 5.1.1
L4T 35.3.1

Regards

I can add that another difference is that I have Cuda 11.4 on the Orin AGX and Cuda 12.1 on the computer.

Hi,

Could you monitor the device with tegrastats to see if the GPU is fully occupied?

$ sudo tegrastats

Thanks.

1 Like

Hi,

This is what I get on jtop and tegrastats.

My GPU usage oscilates between 65-99%.

All my libraries are cuda compatible:
OpenCV
ONNX runtime
Torch
Torchvision

And I use torch2trt to generate TRT accelerated model. It worked well on my computer but it somehow didn’t accelerate much on Jetson Orin AGX.

Hi,

It’s expected GPU resources are always 99% occupied when fully utilized.
Based on your log, GPU might need to wait for the data to compute in your pipeline.

Since you have a TensorRT engine, could you run it with trtexec to verify this?

$ /usr/src/tensorrt/bin/trtexec --loadEngine=[file]

A common cause is from the OpenCV that contains some data transfer which runs relatively slower on Jetson compared to the desktop environment.

Thanks.

Ok thanks, I will try this.

I also got [ Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output ] while TRT model was being generated. I didn’t get that when creating the TRT model on my computer.
Can this be another reason to the sub-optimal GPU usage and low FPS ?

This is what I got running

/usr/src/tensorrt/bin/trtexec --loadEngine=[file]

&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_trt.engine
[10/25/2023-12:25:27] [I] === Model Options ===
[10/25/2023-12:25:27] [I] Format: *
[10/25/2023-12:25:27] [I] Model:
[10/25/2023-12:25:27] [I] Output:
[10/25/2023-12:25:27] [I] === Build Options ===
[10/25/2023-12:25:27] [I] Max batch: 1
[10/25/2023-12:25:27] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/25/2023-12:25:27] [I] minTiming: 1
[10/25/2023-12:25:27] [I] avgTiming: 8
[10/25/2023-12:25:27] [I] Precision: FP32
[10/25/2023-12:25:27] [I] LayerPrecisions:
[10/25/2023-12:25:27] [I] Calibration:
[10/25/2023-12:25:27] [I] Refit: Disabled
[10/25/2023-12:25:27] [I] Sparsity: Disabled
[10/25/2023-12:25:27] [I] Safe mode: Disabled
[10/25/2023-12:25:27] [I] DirectIO mode: Disabled
[10/25/2023-12:25:27] [I] Restricted mode: Disabled
[10/25/2023-12:25:27] [I] Build only: Disabled
[10/25/2023-12:25:27] [I] Save engine:
[10/25/2023-12:25:27] [I] Load engine: model_trt.engine
[10/25/2023-12:25:27] [I] Profiling verbosity: 0
[10/25/2023-12:25:27] [I] Tactic sources: Using default tactic sources
[10/25/2023-12:25:27] [I] timingCacheMode: local
[10/25/2023-12:25:27] [I] timingCacheFile:
[10/25/2023-12:25:27] [I] Heuristic: Disabled
[10/25/2023-12:25:27] [I] Preview Features: Use default preview flags.
[10/25/2023-12:25:27] [I] Input(s)s format: fp32:CHW
[10/25/2023-12:25:27] [I] Output(s)s format: fp32:CHW
[10/25/2023-12:25:27] [I] Input build shapes: model
[10/25/2023-12:25:27] [I] Input calibration shapes: model
[10/25/2023-12:25:27] [I] === System Options ===
[10/25/2023-12:25:27] [I] Device: 0
[10/25/2023-12:25:27] [I] DLACore:
[10/25/2023-12:25:27] [I] Plugins:
[10/25/2023-12:25:27] [I] === Inference Options ===
[10/25/2023-12:25:27] [I] Batch: 1
[10/25/2023-12:25:27] [I] Input inference shapes: model
[10/25/2023-12:25:27] [I] Iterations: 10
[10/25/2023-12:25:27] [I] Duration: 3s (+ 200ms warm up)
[10/25/2023-12:25:27] [I] Sleep time: 0ms
[10/25/2023-12:25:27] [I] Idle time: 0ms
[10/25/2023-12:25:27] [I] Streams: 1
[10/25/2023-12:25:27] [I] ExposeDMA: Disabled
[10/25/2023-12:25:27] [I] Data transfers: Enabled
[10/25/2023-12:25:27] [I] Spin-wait: Disabled
[10/25/2023-12:25:27] [I] Multithreading: Disabled
[10/25/2023-12:25:27] [I] CUDA Graph: Disabled
[10/25/2023-12:25:27] [I] Separate profiling: Disabled
[10/25/2023-12:25:27] [I] Time Deserialize: Disabled
[10/25/2023-12:25:27] [I] Time Refit: Disabled
[10/25/2023-12:25:27] [I] NVTX verbosity: 0
[10/25/2023-12:25:27] [I] Persistent Cache Ratio: 0
[10/25/2023-12:25:27] [I] Inputs:
[10/25/2023-12:25:27] [I] === Reporting Options ===
[10/25/2023-12:25:27] [I] Verbose: Disabled
[10/25/2023-12:25:27] [I] Averages: 10 inferences
[10/25/2023-12:25:27] [I] Percentiles: 90,95,99
[10/25/2023-12:25:27] [I] Dump refittable layers:Disabled
[10/25/2023-12:25:27] [I] Dump output: Disabled
[10/25/2023-12:25:27] [I] Profile: Disabled
[10/25/2023-12:25:27] [I] Export timing to JSON file:
[10/25/2023-12:25:27] [I] Export output to JSON file:
[10/25/2023-12:25:27] [I] Export profile to JSON file:
[10/25/2023-12:25:27] [I]
[10/25/2023-12:25:27] [I] === Device Information ===
[10/25/2023-12:25:27] [I] Selected Device: Orin
[10/25/2023-12:25:27] [I] Compute Capability: 8.7
[10/25/2023-12:25:27] [I] SMs: 14
[10/25/2023-12:25:27] [I] Compute Clock Rate: 0.93 GHz
[10/25/2023-12:25:27] [I] Device Global Memory: 30587 MiB
[10/25/2023-12:25:27] [I] Shared Memory per SM: 164 KiB
[10/25/2023-12:25:27] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/25/2023-12:25:27] [I] Memory Clock Rate: 0.93 GHz
[10/25/2023-12:25:27] [I]
[10/25/2023-12:25:27] [I] TensorRT version: 8.5.2
[10/25/2023-12:25:27] [I] Engine loaded in 0.111328 sec.
[10/25/2023-12:25:28] [I] [TRT] Loaded engine size: 191 MiB
[10/25/2023-12:25:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +188, now: CPU 0, GPU 188 (MiB)
[10/25/2023-12:25:28] [I] Engine deserialized in 0.459144 sec.
[10/25/2023-12:25:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +21, now: CPU 0, GPU 209 (MiB)
[10/25/2023-12:25:28] [I] Setting persistentCacheLimit to 0 bytes.
[10/25/2023-12:25:28] [I] Using random values for input input_0
[10/25/2023-12:25:28] [I] Created input binding for input_0 with dimensions 1x3x384x640
[10/25/2023-12:25:28] [I] Using random values for output output_0
[10/25/2023-12:25:28] [I] Created output binding for output_0 with dimensions 1x5040x6
[10/25/2023-12:25:28] [I] Starting inference
[10/25/2023-12:25:31] [I] Warmup completed 13 queries over 200 ms
[10/25/2023-12:25:31] [I] Timing trace has 202 queries over 3.05078 s
[10/25/2023-12:25:31] [I]
[10/25/2023-12:25:31] [I] === Trace details ===
[10/25/2023-12:25:31] [I] Trace averages of 10 runs:
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9151 ms - Host latency: 15.0672 ms (enqueue 2.24111 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9089 ms - Host latency: 15.0574 ms (enqueue 2.41513 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.2498 ms - Host latency: 15.399 ms (enqueue 2.18823 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9032 ms - Host latency: 15.0534 ms (enqueue 2.38443 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9036 ms - Host latency: 15.056 ms (enqueue 2.31855 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.1243 ms - Host latency: 15.2745 ms (enqueue 2.23926 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.2549 ms - Host latency: 15.4055 ms (enqueue 2.41406 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9061 ms - Host latency: 15.0557 ms (enqueue 2.29213 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9072 ms - Host latency: 15.0634 ms (enqueue 2.28705 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9827 ms - Host latency: 15.1308 ms (enqueue 2.34404 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.2505 ms - Host latency: 15.399 ms (enqueue 2.57365 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9034 ms - Host latency: 15.0542 ms (enqueue 2.26763 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.1184 ms - Host latency: 15.2677 ms (enqueue 2.41906 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9095 ms - Host latency: 15.0582 ms (enqueue 2.33872 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.2526 ms - Host latency: 15.4025 ms (enqueue 2.33296 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9 ms - Host latency: 15.0471 ms (enqueue 2.39587 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.8994 ms - Host latency: 15.0484 ms (enqueue 2.2696 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 14.9083 ms - Host latency: 15.0576 ms (enqueue 2.34155 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.2489 ms - Host latency: 15.3978 ms (enqueue 2.37954 ms)
[10/25/2023-12:25:31] [I] Average on 10 runs - GPU latency: 15.134 ms - Host latency: 15.283 ms (enqueue 2.55666 ms)
[10/25/2023-12:25:31] [I]
[10/25/2023-12:25:31] [I] === Performance summary ===
[10/25/2023-12:25:31] [I] Throughput: 66.2127 qps
[10/25/2023-12:25:31] [I] Latency: min = 15.0173 ms, max = 17.2866 ms, mean = 15.1779 ms, median = 15.0572 ms, percentile(90%) = 15.0862 ms, percentile(95%) = 16.3351 ms, percentile(99%) = 17.248 ms
[10/25/2023-12:25:31] [I] Enqueue Time: min = 1.24451 ms, max = 3.85168 ms, mean = 2.3456 ms, median = 2.33392 ms, percentile(90%) = 2.55066 ms, percentile(95%) = 2.55591 ms, percentile(99%) = 3.82007 ms
[10/25/2023-12:25:31] [I] H2D Latency: min = 0.130249 ms, max = 0.150421 ms, mean = 0.135917 ms, median = 0.13501 ms, percentile(90%) = 0.139526 ms, percentile(95%) = 0.14447 ms, percentile(99%) = 0.147217 ms
[10/25/2023-12:25:31] [I] GPU Compute Time: min = 14.8698 ms, max = 17.1372 ms, mean = 15.0281 ms, median = 14.908 ms, percentile(90%) = 14.9363 ms, percentile(95%) = 16.1836 ms, percentile(99%) = 17.0989 ms
[10/25/2023-12:25:31] [I] D2H Latency: min = 0.00610352 ms, max = 0.0175781 ms, mean = 0.0139218 ms, median = 0.013916 ms, percentile(90%) = 0.0153809 ms, percentile(95%) = 0.0158691 ms, percentile(99%) = 0.0166626 ms
[10/25/2023-12:25:31] [I] Total Host Walltime: 3.05078 s
[10/25/2023-12:25:31] [I] Total GPU Compute Time: 3.03567 s
[10/25/2023-12:25:31] [W] * GPU compute time is unstable, with coefficient of variance = 3.08206%.
[10/25/2023-12:25:31] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/25/2023-12:25:31] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/25/2023-12:25:31] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_trt.engine

Another possible reason for the GPU not performing as good as it should ?
I do not have a working fan on the Orin AGX I’m using as you can see here:

Hi,

Suppose the workspace log is a warning rather than an error.
You can set it to a larger value based on your use case.

The inference time is ~15ms on Orin.
Could you repeat the same on your 3080?

If the TensorRT perf difference is acceptable then the lower FPS should be caused by other libs like OpenCV.

Thanks.

1 Like

Sure, here are my results on 3080. To me, Orin inference should be much better than the one I get here, which is not the case. What do you think could be the reason for that please ?

&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # /home/anis/cv_base/installs/TensorRT-8.6.1.6/targets/x86_64-linux-gnu/bin/trtexec --loadEngine=real_model_trt.engine
[10/26/2023-10:31:47] [I] === Model Options ===
[10/26/2023-10:31:47] [I] Format: *
[10/26/2023-10:31:47] [I] Model:
[10/26/2023-10:31:47] [I] Output:
[10/26/2023-10:31:47] [I] === Build Options ===
[10/26/2023-10:31:47] [I] Max batch: 1
[10/26/2023-10:31:47] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/26/2023-10:31:47] [I] minTiming: 1
[10/26/2023-10:31:47] [I] avgTiming: 8
[10/26/2023-10:31:47] [I] Precision: FP32
[10/26/2023-10:31:47] [I] LayerPrecisions:
[10/26/2023-10:31:47] [I] Layer Device Types:
[10/26/2023-10:31:47] [I] Calibration:
[10/26/2023-10:31:47] [I] Refit: Disabled
[10/26/2023-10:31:47] [I] Version Compatible: Disabled
[10/26/2023-10:31:47] [I] TensorRT runtime: full
[10/26/2023-10:31:47] [I] Lean DLL Path:
[10/26/2023-10:31:47] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[10/26/2023-10:31:47] [I] Exclude Lean Runtime: Disabled
[10/26/2023-10:31:47] [I] Sparsity: Disabled
[10/26/2023-10:31:47] [I] Safe mode: Disabled
[10/26/2023-10:31:47] [I] Build DLA standalone loadable: Disabled
[10/26/2023-10:31:47] [I] Allow GPU fallback for DLA: Disabled
[10/26/2023-10:31:47] [I] DirectIO mode: Disabled
[10/26/2023-10:31:47] [I] Restricted mode: Disabled
[10/26/2023-10:31:47] [I] Skip inference: Disabled
[10/26/2023-10:31:47] [I] Save engine:
[10/26/2023-10:31:47] [I] Load engine: real_model_trt.engine
[10/26/2023-10:31:47] [I] Profiling verbosity: 0
[10/26/2023-10:31:47] [I] Tactic sources: Using default tactic sources
[10/26/2023-10:31:47] [I] timingCacheMode: local
[10/26/2023-10:31:47] [I] timingCacheFile:
[10/26/2023-10:31:47] [I] Heuristic: Disabled
[10/26/2023-10:31:47] [I] Preview Features: Use default preview flags.
[10/26/2023-10:31:47] [I] MaxAuxStreams: -1
[10/26/2023-10:31:47] [I] BuilderOptimizationLevel: -1
[10/26/2023-10:31:47] [I] Input(s)s format: fp32:CHW
[10/26/2023-10:31:47] [I] Output(s)s format: fp32:CHW
[10/26/2023-10:31:47] [I] Input build shapes: model
[10/26/2023-10:31:47] [I] Input calibration shapes: model
[10/26/2023-10:31:47] [I] === System Options ===
[10/26/2023-10:31:47] [I] Device: 0
[10/26/2023-10:31:47] [I] DLACore:
[10/26/2023-10:31:47] [I] Plugins:
[10/26/2023-10:31:47] [I] setPluginsToSerialize:
[10/26/2023-10:31:47] [I] dynamicPlugins:
[10/26/2023-10:31:47] [I] ignoreParsedPluginLibs: 0
[10/26/2023-10:31:47] [I]
[10/26/2023-10:31:47] [I] === Inference Options ===
[10/26/2023-10:31:47] [I] Batch: 1
[10/26/2023-10:31:47] [I] Input inference shapes: model
[10/26/2023-10:31:47] [I] Iterations: 10
[10/26/2023-10:31:47] [I] Duration: 3s (+ 200ms warm up)
[10/26/2023-10:31:47] [I] Sleep time: 0ms
[10/26/2023-10:31:47] [I] Idle time: 0ms
[10/26/2023-10:31:47] [I] Inference Streams: 1
[10/26/2023-10:31:47] [I] ExposeDMA: Disabled
[10/26/2023-10:31:47] [I] Data transfers: Enabled
[10/26/2023-10:31:47] [I] Spin-wait: Disabled
[10/26/2023-10:31:47] [I] Multithreading: Disabled
[10/26/2023-10:31:47] [I] CUDA Graph: Disabled
[10/26/2023-10:31:47] [I] Separate profiling: Disabled
[10/26/2023-10:31:47] [I] Time Deserialize: Disabled
[10/26/2023-10:31:47] [I] Time Refit: Disabled
[10/26/2023-10:31:47] [I] NVTX verbosity: 0
[10/26/2023-10:31:47] [I] Persistent Cache Ratio: 0
[10/26/2023-10:31:47] [I] Inputs:
[10/26/2023-10:31:47] [I] === Reporting Options ===
[10/26/2023-10:31:47] [I] Verbose: Disabled
[10/26/2023-10:31:47] [I] Averages: 10 inferences
[10/26/2023-10:31:47] [I] Percentiles: 90,95,99
[10/26/2023-10:31:47] [I] Dump refittable layers:Disabled
[10/26/2023-10:31:47] [I] Dump output: Disabled
[10/26/2023-10:31:47] [I] Profile: Disabled
[10/26/2023-10:31:47] [I] Export timing to JSON file:
[10/26/2023-10:31:47] [I] Export output to JSON file:
[10/26/2023-10:31:47] [I] Export profile to JSON file:
[10/26/2023-10:31:47] [I]
[10/26/2023-10:31:48] [I] === Device Information ===
[10/26/2023-10:31:48] [I] Selected Device: NVIDIA GeForce RTX 3080 Ti Laptop GPU
[10/26/2023-10:31:48] [I] Compute Capability: 8.6
[10/26/2023-10:31:48] [I] SMs: 58
[10/26/2023-10:31:48] [I] Device Global Memory: 16116 MiB
[10/26/2023-10:31:48] [I] Shared Memory per SM: 100 KiB
[10/26/2023-10:31:48] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/26/2023-10:31:48] [I] Application Compute Clock Rate: 1.545 GHz
[10/26/2023-10:31:48] [I] Application Memory Clock Rate: 8.001 GHz
[10/26/2023-10:31:48] [I]
[10/26/2023-10:31:48] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[10/26/2023-10:31:48] [I]
[10/26/2023-10:31:48] [I] TensorRT version: 8.6.1
[10/26/2023-10:31:48] [I] Loading standard plugins
[10/26/2023-10:31:49] [I] Engine loaded in 0.164344 sec.
[10/26/2023-10:31:49] [I] [TRT] Loaded engine size: 192 MiB
[10/26/2023-10:31:49] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +188, now: CPU 0, GPU 188 (MiB)
[10/26/2023-10:31:49] [I] Engine deserialized in 0.412792 sec.
[10/26/2023-10:31:49] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +20, now: CPU 1, GPU 208 (MiB)
[10/26/2023-10:31:49] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C Programming Guide
[10/26/2023-10:31:49] [I] Setting persistentCacheLimit to 0 bytes.
[10/26/2023-10:31:49] [I] Using random values for input input_0
[10/26/2023-10:31:49] [I] Input binding for input_0 with dimensions 1x3x384x640 is created.
[10/26/2023-10:31:49] [I] Output binding for output_0 with dimensions 1x5040x6 is created.
[10/26/2023-10:31:49] [I] Starting inference
[10/26/2023-10:31:52] [I] Warmup completed 43 queries over 200 ms
[10/26/2023-10:31:52] [I] Timing trace has 631 queries over 3.01371 s
[10/26/2023-10:31:52] [I]
[10/26/2023-10:31:52] [I] === Trace details ===
[10/26/2023-10:31:52] [I] Trace averages of 10 runs:
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.72258 ms - Host latency: 4.99298 ms (enqueue 1.50741 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7232 ms - Host latency: 4.99966 ms (enqueue 1.43652 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.72433 ms - Host latency: 4.99382 ms (enqueue 1.54504 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73324 ms - Host latency: 5.00919 ms (enqueue 1.4942 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73836 ms - Host latency: 5.01987 ms (enqueue 1.53353 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73672 ms - Host latency: 5.00632 ms (enqueue 1.4737 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73579 ms - Host latency: 5.00628 ms (enqueue 1.51228 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73734 ms - Host latency: 5.00536 ms (enqueue 1.44769 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73763 ms - Host latency: 5.00671 ms (enqueue 1.52356 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73887 ms - Host latency: 5.02597 ms (enqueue 1.53574 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73817 ms - Host latency: 5.00743 ms (enqueue 1.50223 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73702 ms - Host latency: 5.00533 ms (enqueue 1.52804 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73755 ms - Host latency: 5.00848 ms (enqueue 1.4395 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.73704 ms - Host latency: 5.00766 ms (enqueue 1.45115 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.8684 ms - Host latency: 5.13648 ms (enqueue 1.56238 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.85784 ms - Host latency: 5.12993 ms (enqueue 1.49645 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.83175 ms - Host latency: 5.10137 ms (enqueue 1.56773 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.8043 ms - Host latency: 5.07369 ms (enqueue 1.52651 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78649 ms - Host latency: 5.06724 ms (enqueue 1.49972 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76908 ms - Host latency: 5.03754 ms (enqueue 1.56487 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76509 ms - Host latency: 5.04608 ms (enqueue 1.55416 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.75299 ms - Host latency: 5.02311 ms (enqueue 1.39222 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.74756 ms - Host latency: 5.02847 ms (enqueue 1.51088 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7441 ms - Host latency: 5.02529 ms (enqueue 1.43589 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.75104 ms - Host latency: 5.0443 ms (enqueue 1.48708 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.74368 ms - Host latency: 5.01317 ms (enqueue 1.52688 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.74326 ms - Host latency: 5.01222 ms (enqueue 1.46608 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.74431 ms - Host latency: 5.01556 ms (enqueue 1.54716 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.74412 ms - Host latency: 5.01187 ms (enqueue 1.49382 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.74561 ms - Host latency: 5.01483 ms (enqueue 1.55328 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7657 ms - Host latency: 5.03463 ms (enqueue 1.48347 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7522 ms - Host latency: 5.02009 ms (enqueue 1.44824 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.74883 ms - Host latency: 5.02306 ms (enqueue 1.54489 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78022 ms - Host latency: 5.0602 ms (enqueue 1.39436 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.81821 ms - Host latency: 5.08734 ms (enqueue 1.37657 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.82599 ms - Host latency: 5.10791 ms (enqueue 1.56145 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.82363 ms - Host latency: 5.09242 ms (enqueue 1.55054 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.80361 ms - Host latency: 5.08021 ms (enqueue 1.5431 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7874 ms - Host latency: 5.05627 ms (enqueue 1.47452 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78599 ms - Host latency: 5.05696 ms (enqueue 1.54534 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7699 ms - Host latency: 5.03936 ms (enqueue 1.54199 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7668 ms - Host latency: 5.03613 ms (enqueue 1.47878 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76887 ms - Host latency: 5.04919 ms (enqueue 1.41929 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76907 ms - Host latency: 5.04937 ms (enqueue 1.52446 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7698 ms - Host latency: 5.06228 ms (enqueue 1.55574 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76877 ms - Host latency: 5.0469 ms (enqueue 1.42532 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76836 ms - Host latency: 5.04028 ms (enqueue 1.37715 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76436 ms - Host latency: 5.03508 ms (enqueue 1.4637 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76123 ms - Host latency: 5.02927 ms (enqueue 1.60452 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76121 ms - Host latency: 5.03076 ms (enqueue 1.51987 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.77354 ms - Host latency: 5.04363 ms (enqueue 1.52141 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76453 ms - Host latency: 5.03586 ms (enqueue 1.47712 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76694 ms - Host latency: 5.03901 ms (enqueue 1.52253 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76855 ms - Host latency: 5.05759 ms (enqueue 1.38064 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.77751 ms - Host latency: 5.03938 ms (enqueue 1.43896 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.77976 ms - Host latency: 5.04946 ms (enqueue 1.42031 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78242 ms - Host latency: 5.06536 ms (enqueue 1.52871 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78118 ms - Host latency: 5.05344 ms (enqueue 1.59441 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78259 ms - Host latency: 5.05225 ms (enqueue 1.61548 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78081 ms - Host latency: 5.05215 ms (enqueue 1.43298 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.78037 ms - Host latency: 5.04861 ms (enqueue 1.44202 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.7842 ms - Host latency: 5.06677 ms (enqueue 1.55566 ms)
[10/26/2023-10:31:52] [I] Average on 10 runs - GPU latency: 4.76714 ms - Host latency: 5.03625 ms (enqueue 1.51907 ms)
[10/26/2023-10:31:52] [I]
[10/26/2023-10:31:52] [I] === Performance summary ===
[10/26/2023-10:31:52] [I] Throughput: 209.376 qps
[10/26/2023-10:31:52] [I] Latency: min = 4.98651 ms, max = 5.21228 ms, mean = 5.04054 ms, median = 5.03833 ms, percentile(90%) = 5.09277 ms, percentile(95%) = 5.12927 ms, percentile(99%) = 5.16223 ms
[10/26/2023-10:31:52] [I] Enqueue Time: min = 0.954346 ms, max = 1.86316 ms, mean = 1.49854 ms, median = 1.48975 ms, percentile(90%) = 1.67883 ms, percentile(95%) = 1.71997 ms, percentile(99%) = 1.78662 ms
[10/26/2023-10:31:52] [I] H2D Latency: min = 0.240845 ms, max = 0.38501 ms, mean = 0.258606 ms, median = 0.254639 ms, percentile(90%) = 0.258179 ms, percentile(95%) = 0.269653 ms, percentile(99%) = 0.374023 ms
[10/26/2023-10:31:52] [I] GPU Compute Time: min = 4.71759 ms, max = 4.87933 ms, mean = 4.76711 ms, median = 4.76074 ms, percentile(90%) = 4.81995 ms, percentile(95%) = 4.82825 ms, percentile(99%) = 4.87219 ms
[10/26/2023-10:31:52] [I] D2H Latency: min = 0.012207 ms, max = 0.0400391 ms, mean = 0.0148242 ms, median = 0.0145264 ms, percentile(90%) = 0.0153809 ms, percentile(95%) = 0.015625 ms, percentile(99%) = 0.0290527 ms
[10/26/2023-10:31:52] [I] Total Host Walltime: 3.01371 s
[10/26/2023-10:31:52] [I] Total GPU Compute Time: 3.00805 s
[10/26/2023-10:31:52] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/26/2023-10:31:52] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # /home/anis/cv_base/installs/TensorRT-8.6.1.6/targets/x86_64-linux-gnu/bin/trtexec --loadEngine=real_model_trt.engine

Even without TensorRT my RTX 3080 Ti is 3x faster for inference.
I benchmarked my Orin AGX and it gives me results comparable to the ones in https://developer.nvidia.com/embedded/jetson-benchmarks.
So there are no hardware or max performance configuration modifications to do.

Hi,

Just check the spec for Orin and 3080.

  • Orin: 5.3 FLOPS
  • 3080: 29.78 FLOPS

So the difference you see is expected.

On embedded systems, we expect the use case with lower precision.
You can try INT8 or FP16 to see if the performance becomes better.

Thanks.

1 Like

FP16 using model.half() gives similar inference time results.

Hi,

Please try with trtexec with the original model and using the --fp16 flag.

Thanks.