Can't run nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime on Orin AGX

Description

Running l4t-tensorrt-8.2.1 docker on I get the following error:
root@nvidia-desktop:/# /usr/src/tensorrt/bin/trtexec
/usr/src/tensorrt/bin/trtexec: /lib/aarch64-linux-gnu/libm.so.6: version GLIBC_2.29' not found (required by /usr/lib/aarch64-linux-gnu/tegra/libnvdla_compiler.so) /usr/src/tensorrt/bin/trtexec: /usr/lib/aarch64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.26’ not found (required by /usr/lib/aarch64-linux-gnu/tegra/libnvdla_compiler.so)
/usr/src/tensorrt/bin/trtexec: /lib/aarch64-linux-gnu/libm.so.6: version `GLIBC_2.29’ not found (required by /usr/lib/aarch64-linux-gnu/tegra/libnvtvmr.so)

Environment

Orin AGX with JP5.0

TensorRT Version: 8.2.1
GPU Type: Orin AGX 32GB
Nvidia Driver Version: 510
CUDA Version: 11.6
CUDNN Version: as in nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime
Operating System + Version: JP5.0, Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

This looks like a Jetson issue. Please refer to the below samples in case useful.

For any further assistance, we will move this post to to Jetson related forum.

Thanks!

Hi,

The issue is still relevant. The repositories suggested by @NVES are not relevant for my use case.

What I’m trying to do is compare performance of a private ONNX model processed by JP5.0 native TensoRT 8.4 EA and TensorRT 8.2.x from the nvcr-io-nvidia-l4t-tensorrt-r8-2-1-runtime docker.

Hi,

We also have TensorRT 8.4 docker for JetPack 5.0 DP.
Do you want to compare the performance between v8.4 and v8.2 or within and without the docker?

Test the trtexec binary in nvcr.io/nvidia/l4t-tensorrt:r8.0.1-runtime, it can work correctly.

Thanks.

Hi @AastaLLL ,

My goal is compare performance of TensorRT 8.4 and TensorRT 8.2.
I have a private model on AGX Xavier 15W that gets poorer performance on AGX Orin 15W. I expected better performance on Orin 15W and trying to debug it. One of the differences between the platforms is TensorRT.

The docker is just my way of getting TensoRT 8.2 into Orin w/o messing the system.

tensorrt:r8.0.1 produces a similar error message.

user@nvidia-desktop:~$ sudo docker run -it --rm --net=host --runtime nvidia -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/l4t-tensorrt:r8.0.1-runtime
root@nvidia-desktop:/# /usr/src/tensorrt/bin/trtexec
/usr/src/tensorrt/bin/trtexec: /lib/aarch64-linux-gnu/libm.so.6: version GLIBC_2.29' not found (required by /usr/lib/aarch64-linux-gnu/tegra/libnvdla_compiler.so) /usr/src/tensorrt/bin/trtexec: /usr/lib/aarch64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.26’ not found (required by /usr/lib/aarch64-linux-gnu/tegra/libnvdla_compiler.so)
/usr/src/tensorrt/bin/trtexec: /lib/aarch64-linux-gnu/libm.so.6: version `GLIBC_2.29’ not found (required by /usr/lib/aarch64-linux-gnu/tegra/libnvtvmr.so)

Hi,

The l4t-tensorrt:r8.0.1-runtime container is built on the top of JetPack 4.6.
Since JetPack 4.6 uses Ubuntu 18.04, it’s possible that there are some compatibility issues.
(Previous containers will mount libraries from the host)

The performance issue you mentioned is important to us.
Is it possible to share the model via message so we can give it a check?

Thanks.

Hi,

We give it a try with the ResNet50.onnx model but were not able to reproduce this issue.

Xavier

$ sudo nvpmodel -q
NV Fan Mode:quiet
NV Power Mode: MODE_15W
2
$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx
...
[04/28/2022-11:16:50] [I] === Performance summary ===
[04/28/2022-11:16:50] [I] Throughput: 63.1476 qps
[04/28/2022-11:16:50] [I] Latency: min = 15.7756 ms, max = 15.9119 ms, mean = 15.8206 ms, median = 15.8167 ms, percentile(99%) = 15.9011 ms
[04/28/2022-11:16:50] [I] End-to-End Host Latency: min = 15.7852 ms, max = 15.9302 ms, mean = 15.8358 ms, median = 15.8317 ms, percentile(99%) = 15.9097 ms
[04/28/2022-11:16:50] [I] Enqueue Time: min = 1.13879 ms, max = 1.62451 ms, mean = 1.2817 ms, median = 1.27686 ms, percentile(99%) = 1.42432 ms
[04/28/2022-11:16:50] [I] H2D Latency: min = 0.0385742 ms, max = 0.0421753 ms, mean = 0.0398231 ms, median = 0.0395508 ms, percentile(99%) = 0.0420532 ms
[04/28/2022-11:16:50] [I] GPU Compute Time: min = 15.7318 ms, max = 15.8689 ms, mean = 15.7789 ms, median = 15.7756 ms, percentile(99%) = 15.8577 ms
[04/28/2022-11:16:50] [I] D2H Latency: min = 0.00146484 ms, max = 0.00244141 ms, mean = 0.00182954 ms, median = 0.00195312 ms, percentile(99%) = 0.00244141 ms
[04/28/2022-11:16:50] [I] Total Host Walltime: 3.02466 s
[04/28/2022-11:16:50] [I] Total GPU Compute Time: 3.01378 s
[04/28/2022-11:16:50] [I] Explanations of the performance metrics are printed in the verbose logs.
[04/28/2022-11:16:50] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx

Orin

$ sudo nvpmodel -q
NV Power Mode: MODE_15W
1
$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx
...
[04/28/2022-03:15:21] [I] === Performance summary ===
[04/28/2022-03:15:21] [I] Throughput: 89.0295 qps
[04/28/2022-03:15:21] [I] Latency: min = 11.1973 ms, max = 11.301 ms, mean = 11.2604 ms, median = 11.2612 ms, percentile(99%) = 11.2981 ms
[04/28/2022-03:15:21] [I] Enqueue Time: min = 0.457764 ms, max = 0.522461 ms, mean = 0.471977 ms, median = 0.468628 ms, percentile(99%) = 0.50415 ms
[04/28/2022-03:15:21] [I] H2D Latency: min = 0.0643311 ms, max = 0.0765381 ms, mean = 0.0664937 ms, median = 0.065918 ms, percentile(99%) = 0.0721436 ms
[04/28/2022-03:15:21] [I] GPU Compute Time: min = 11.1248 ms, max = 11.2279 ms, mean = 11.1882 ms, median = 11.1887 ms, percentile(99%) = 11.2277 ms
[04/28/2022-03:15:21] [I] D2H Latency: min = 0.00408936 ms, max = 0.00708008 ms, mean = 0.00569752 ms, median = 0.00579834 ms, percentile(99%) = 0.00701904 ms
[04/28/2022-03:15:21] [I] Total Host Walltime: 3.0327 s
[04/28/2022-03:15:21] [I] Total GPU Compute Time: 3.02082 s
[04/28/2022-03:15:21] [I] Explanations of the performance metrics are printed in the verbose logs.
[04/28/2022-03:15:21] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx

Thanks.

Hello,

I’m still checking whether I can share the model.
Generally, I’m trying to generate INT8 engine for my model.

About your ResNet50.onnx results: The ~41% improvement on Orin is more modest than what I would expect, taking into account clock speed and number of GPU core on both devices at 15W. Can you explain those results?

Hi @AastaLLL ,

  1. Can I please get your e-mail to share the model privately? I can’t publish it to GitHub.

  2. In the meantime, I’ve encountered a similar issue with Yolov5m ((GitHub - ultralytics/yolov5: YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite, master, sha: 177da7f348181abdf4820ad26707eb8b3dd4fdc9) where Orin produces similar results as Xavier (model attached):
    Xavier:

$ sudo nvpmodel -q
NV Fan Mode:cool
NV Power Mode: MODE_15W

$ /usr/src/tensorrt/bin/trtexec --onnx=yolov5m_b1.onnx --saveEngine=yolov5m_fp16_int8_b1.trt --workspace=3000 --warmUp=1000 --int8 --fp16 --iterations=500 --explicitBatch --useCudaGraph

[10/24/2021-18:39:07] [I] === Performance summary ===
[10/24/2021-18:39:07] [I] Throughput: 46.5244 qps
[10/24/2021-18:39:07] [I] Latency: min = 21.3828 ms, max = 25.6064 ms, mean = 21.4827 ms, median = 21.4678 ms, percentile(99%) = 21.5322 ms
[10/24/2021-18:39:07] [I] End-to-End Host Latency: min = 21.3994 ms, max = 25.6255 ms, mean = 21.494 ms, median = 21.4795 ms, percentile(99%) = 21.5428 ms
[10/24/2021-18:39:07] [I] Enqueue Time: min = 0.258789 ms, max = 1.52637 ms, mean = 0.300024 ms, median = 0.286255 ms, percentile(99%) = 0.523438 ms
[10/24/2021-18:39:07] [I] H2D Latency: min = 0.266602 ms, max = 0.307617 ms, mean = 0.268346 ms, median = 0.268311 ms, percentile(99%) = 0.269653 ms
[10/24/2021-18:39:07] [I] GPU Compute Time: min = 20.5654 ms, max = 24.7979 ms, mean = 20.66 ms, median = 20.6448 ms, percentile(99%) = 20.7063 ms
[10/24/2021-18:39:07] [I] D2H Latency: min = 0.458984 ms, max = 0.561157 ms, mean = 0.5543 ms, median = 0.554688 ms, percentile(99%) = 0.55957 ms
[10/24/2021-18:39:07] [I] Total Host Walltime: 10.7471 s
[10/24/2021-18:39:07] [I] Total GPU Compute Time: 10.33 s
[10/24/2021-18:39:07] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/24/2021-18:39:07] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5m_b1.onnx --saveEngine=yolov5m_fp16_int8_b1.trt --workspace=3000 --warmUp=1000 --int8 --fp16 --iterations=500 --explicitBatch --useCudaGraph

Orin:

$ sudo nvpmodel -q
NV Power Mode: MODE_15W
1
$ /usr/src/tensorrt/bin/trtexec --onnx=yolov5m_b1.onnx --saveEngine=yolov5m_fp16_int8_b1.engine --fp16 --int8 --warmUp=1000 --workspace=3000 --iterations=500
[05/01/2022-17:06:44] [I] === Performance summary ===
[05/01/2022-17:06:44] [I] Throughput: 47.5897 qps
[05/01/2022-17:06:44] [I] Latency: min = 21.9463 ms, max = 31.2423 ms, mean = 22.368 ms, median = 22.1973 ms, percentile(99%) = 28.9878 ms
[05/01/2022-17:06:44] [I] Enqueue Time: min = 1.21045 ms, max = 2.52441 ms, mean = 1.33105 ms, median = 1.26465 ms, percentile(99%) = 1.74902 ms
[05/01/2022-17:06:44] [I] H2D Latency: min = 0.426636 ms, max = 0.625244 ms, mean = 0.43153 ms, median = 0.429688 ms, percentile(99%) = 0.453125 ms
[05/01/2022-17:06:44] [I] GPU Compute Time: min = 20.7031 ms, max = 29.8458 ms, mean = 20.9694 ms, median = 20.7993 ms, percentile(99%) = 27.6223 ms
[05/01/2022-17:06:44] [I] D2H Latency: min = 0.727539 ms, max = 0.972168 ms, mean = 0.967079 ms, median = 0.968018 ms, percentile(99%) = 0.97168 ms
[05/01/2022-17:06:44] [I] Total Host Walltime: 10.5065 s
[05/01/2022-17:06:44] [I] Total GPU Compute Time: 10.4847 s
[05/01/2022-17:06:44] [W] * GPU compute time is unstable, with coefficient of variance = 4.94206%.
[05/01/2022-17:06:44] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/01/2022-17:06:44] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/01/2022-17:06:44] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5m_b1.onnx --saveEngine=yolov5m_fp16_int8_b1.engine --fp16 --int8 --warmUp=1000 --workspace=3000 --iterations=500

yolov5m_b1.onnx (81.2 MB)

Hi,

Please share it with the private message directly.

Just want to confirm first.
In yolov5m_b1.onnx, you got a similar performance between Orin and Xavier.
In your private model, the performance of Orin is worse than that of Xavier.

Is that correct?

Thanks.

Hi,

More, please use either --fp16 or --int8 when inferencing.

The parser will take the last one as inference mode.
So based on your log, Xavier uses fp16 mode but Orin runs with int8 mode.

If you want TensorRT to run it with mixed precision, please use --best instead.
Thanks.

Thanks @AastaLLL .

  1. I shared my private model yesterday via ‘messages’. Please let me know if you didn’t get it.
    About the private model - yes, the performance of Orin is worse than that of Xavier.

  2. I re-ran the yolov5m_b1.onnx for Xavier with --int8 only. The numbers did not change significantly.

[10/26/2021-14:03:10] [I] === Performance summary ===
[10/26/2021-14:03:10] [I] Throughput: 45.4298 qps
[10/26/2021-14:03:10] [I] Latency: min = 21.9102 ms, max = 27.395 ms, mean = 22.0009 ms, median = 21.9845 ms, percentile(99%) = 22.0828 ms
[10/26/2021-14:03:10] [I] End-to-End Host Latency: min = 21.918 ms, max = 27.4088 ms, mean = 22.0119 ms, median = 21.9951 ms, percentile(99%) = 22.0918 ms
[10/26/2021-14:03:10] [I] Enqueue Time: min = 0.265625 ms, max = 0.792236 ms, mean = 0.30576 ms, median = 0.292969 ms, percentile(99%) = 0.513794 ms
[10/26/2021-14:03:10] [I] H2D Latency: min = 0.266602 ms, max = 0.311523 ms, mean = 0.268542 ms, median = 0.268555 ms, percentile(99%) = 0.269531 ms
[10/26/2021-14:03:10] [I] GPU Compute Time: min = 21.0938 ms, max = 26.5862 ms, mean = 21.1805 ms, median = 21.1641 ms, percentile(99%) = 21.2637 ms
[10/26/2021-14:03:10] [I] D2H Latency: min = 0.458984 ms, max = 0.5625 ms, mean = 0.551903 ms, median = 0.552063 ms, percentile(99%) = 0.557617 ms
[10/26/2021-14:03:10] [I] Total Host Walltime: 11.006 s
[10/26/2021-14:03:10] [I] Total GPU Compute Time: 10.5902 s
[10/26/2021-14:03:10] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/26/2021-14:03:10] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=yolov5m_b1.onnx --saveEngine=yolov5m_int8_b1.trt --workspace=3000 --warmUp=1000 --int8 --iterations=500 --explicitBatch --useCudaGraph

Hi,

Thanks for sharing the model.

Confirmed that we can reproduce this performance issue in our environment
We are checking this issue with our internal team and will get back to you soon.

Thanks.

Hi,

Thanks for your patience.

When under the 15W node, the peak GPU frequency of Xavier is 675 while Orin is 402.
As a result, the performance of Orin may not beat Xavier in certain cases.

Below is the detailed setting of the different nvpmodel for your reference:
https://docs.nvidia.com/jetson/archives/r34.1/DeveloperGuide/text/SD/PlatformPowerAndPerformance/JetsonOrinNxSeriesAndJetsonAgxOrinSeries.html#supported-modes-and-power-efficiency

Thanks.

Thanks @AastaLLL ,

I’m already aware of Xavier & Orin GPU clock difference.
Orin AGX 32GB has 1792 cuda cores & 56 tensor cores.
Xavier has 512 cuda cores & 64 tensor cores.
I assumed Orin having 3.5 times more cuda cores than Xavier can compensate for slower clock speed.
Is this assumption wrong?
How can I boost my model performance for 15W mode on Orin?

Hi,

Please note that the CUDA cores are not fully enabled on the 15W mode.
(This information is also available on the document shared above)

Under the 15W setting, Orin only enables 3 TPC and Xavier has 4 TPC.
This indicates that Orin has 6 SMs but Xavier has 8 SMs.

Thanks.

Thanks @AastaLLL.

So Orin32GB has only 6 out of 14 SM of the CUDA cores online at 15W.
Still, Orin32GB GPU 15W ‘active cores * Clock’ is only about 6% less then Xavier 15W.
My private model shows significant degradation on Orin, more than 6%.
What accounts for this degradation?

Hi,

On 15W, Xavier is 4TPC@674MHz and Orin is 3TPC@420MHz.

We expect that Orin’s per-SM math throughput is double compared to Xavier’s.
So the Orin peak math throughput is 32420/4/674 = 0.93x of Xavier. This also matches your calculation.

Please noted that Orin’s performance is not always slower than Xavier’s under 15W.
It depends on the complexity of the model.

It seems that the degradation of your model is much larger.
We will double-confirm this internally and let you know the feedback.

Thanks.

Hi,

Since the GPU clock is lower under the 15W mode, the throughput of Orin may not reach 2x of Xavier.
This might explain the degradation is more than 7% on your model.

Thanks.