TensorRT 8.6/10.3 is much slower on Jetpack6 than TensorRT 8.5 Jetpack5

Device: Jetson Orin NX 16GB
Power Mode: MAXN

Model1: regnetx_006_Opset16.onnx

Model2: regnetx_320_Opset18.onnx

Model 8.5 8.6 10.3
regnetx_006 723.57 qps 468.011 qps 463.345 qps
regnetx_320 88.4554 qps failed to build 69.8414 qps

TensorRT 8.6/10.3 were benchmarked with the following:

 /usr/src/tensorrt/bin/trtexec --onnx=./regnetx_006_Opset16.onnx --fp16 --builderOptimizationLevel=5

Tried with --preview=-disableExternalTacticSourcesForCore0805(8.6 only) but didn’t help.

TensorRT 8.5:

/usr/src/tensorrt/bin/trtexec --onnx=./regnetx_006_Opset16.onnx --fp16

Also, engine building is way ways slower on TensorRT 8.6 and 10.3.

Is there any configurations that we can try to reduce the performance gap?

regnetx_006_Opset16_fp16_trt85.txt (9.1 KB)
regnetx_006_Opset16_fp16_trt103.txt (12.1 KB)

Here are some info, it looks it’s due to the inserted Reformatting CopyNode, with trt85 there are only at the start and end nodes

Hi,

Thanks a lot for reporting this.
We will give it a try and share more info with you later.

Thanks.

Hi,

Here are our local testing results with OrinNX.

Module JP-5.1.4 + TRT-8.5 JP-6.1 + TRT-10.3
regnetx_006 685.871 qps 503.104 qps
regnetx_320 72.0046 qpa 79.8105 qps

Confirmed we also see the perf regression in regnetx_006 cases.
Our internal team is checking for this issue. Will let you know once we get feedback.

Thanks.

Thanks, the gap is smaller in your case, but glad that you can confirm the issue, looking forward to your feekback and hopefully it can be addressed soon.

Also, just for your info,

regnetx_008_Opset16
and
regnetx_004_Opset16

Don’t have this regression, it can also be verified through the layer info:
regnetx_008_Opset16_layer_info_trt103.txt (3.2 KB)

So I guess it’s related to layout transform when tensor core is being used, but it’s not handled properly with certain tensor/kernel size

Hi,

Thanks for the info.

Our internal team is now checking this issue. Will let you know if we get feedback.

Thanks.

Hi,

Thanks for your patience.

We found the perf drop comes from the group conv with per-group channel = 24.

Typical, GPUs are good at the group conv with channel=4, 8, 16, 32.
Is it possible to change the perGroupC to 32 so TensorRT can pick a better kernel for the network?

Thanks.

Hi

Thanks for coming back. Yeah, we noticed this behaviour too and padded the number of channels to 2^N and the regression is gone. we raised this because there was no issue with tensorrt 8.5, do you have any plan to fix the issue in future releases?

Hi,

Thanks for your patience.
The perf regression is because we removed the legacy kernels as TensorRT focuses more on newer architectures now.

We recommend using a group convolution with N channel per group where N is a power of 2.
A possible WAR is to pad N to the next power of 2.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.