xuang
October 27, 2024, 12:32pm
1
Device: Jetson Orin NX 16GB
Power Mode: MAXN
Model1: regnetx_006_Opset16.onnx
Model2: regnetx_320_Opset18.onnx
Model
8.5
8.6
10.3
regnetx_006
723.57 qps
468.011 qps
463.345 qps
regnetx_320
88.4554 qps
failed to build
69.8414 qps
TensorRT 8.6/10.3 were benchmarked with the following:
/usr/src/tensorrt/bin/trtexec --onnx=./regnetx_006_Opset16.onnx --fp16 --builderOptimizationLevel=5
Tried with --preview=-disableExternalTacticSourcesForCore0805
(8.6 only) but didn’t help.
TensorRT 8.5:
/usr/src/tensorrt/bin/trtexec --onnx=./regnetx_006_Opset16.onnx --fp16
Also, engine building is way ways slower on TensorRT 8.6 and 10.3.
Is there any configurations that we can try to reduce the performance gap?
xuang
October 27, 2024, 3:30pm
2
regnetx_006_Opset16_fp16_trt85.txt (9.1 KB)
regnetx_006_Opset16_fp16_trt103.txt (12.1 KB)
Here are some info, it looks it’s due to the inserted Reformatting CopyNode, with trt85 there are only at the start and end nodes
Hi,
Thanks a lot for reporting this.
We will give it a try and share more info with you later.
Thanks.
Hi,
Here are our local testing results with OrinNX.
Module
JP-5.1.4 + TRT-8.5
JP-6.1 + TRT-10.3
regnetx_006
685.871 qps
503.104 qps
regnetx_320
72.0046 qpa
79.8105 qps
Confirmed we also see the perf regression in regnetx_006 cases.
Our internal team is checking for this issue. Will let you know once we get feedback.
Thanks.
xuang
October 29, 2024, 11:09am
7
Thanks, the gap is smaller in your case, but glad that you can confirm the issue, looking forward to your feekback and hopefully it can be addressed soon.
xuang
October 29, 2024, 10:07pm
8
Also, just for your info,
regnetx_008_Opset16
and
regnetx_004_Opset16
Don’t have this regression, it can also be verified through the layer info:
regnetx_008_Opset16_layer_info_trt103.txt (3.2 KB)
So I guess it’s related to layout transform when tensor core is being used, but it’s not handled properly with certain tensor/kernel size
Hi,
Thanks for the info.
Our internal team is now checking this issue. Will let you know if we get feedback.
Thanks.
Hi,
Thanks for your patience.
We found the perf drop comes from the group conv with per-group channel = 24.
Typical, GPUs are good at the group conv with channel=4, 8, 16, 32.
Is it possible to change the perGroupC to 32 so TensorRT can pick a better kernel for the network?
Thanks.
xuang
November 11, 2024, 9:05am
12
Hi
Thanks for coming back. Yeah, we noticed this behaviour too and padded the number of channels to 2^N and the regression is gone. we raised this because there was no issue with tensorrt 8.5, do you have any plan to fix the issue in future releases?
Hi,
Thanks for your patience.
The perf regression is because we removed the legacy kernels as TensorRT focuses more on newer architectures now.
We recommend using a group convolution with N channel per group where N is a power of 2.
A possible WAR is to pad N to the next power of 2.
Thanks.
system
Closed
December 4, 2024, 3:19am
15
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.