TensorRT tactics skip

Description

Hi, I made a program that build a Network using tensorrt api and do inference.
The issue is that I have huge performance gaps (2x) between native compiled executable and cross compiled executable on jetson xavier NX.
Debugging a bit, I noticed that while optimizing the ActivationLeaky layers of my network the crosscompiled executable skips the PointWiseV2 tactics resulting to much worse performance.

Here is a piece of the tensorrt debug log on native compiled:

TENSORRT LOG: --------------- Timing Runner: ActivationLeaky178 (PointWiseV2)
TENSORRT LOG: Tactic: 0 Time: 0.038912
TENSORRT LOG: Tactic: 1 Time: 0.035232
TENSORRT LOG: Tactic: 2 Time: 0.02864
TENSORRT LOG: Tactic: 3 Time: 0.03424
TENSORRT LOG: Tactic: 4 Time: 0.027648
TENSORRT LOG: Tactic: 5 Time: 0.026912
TENSORRT LOG: Tactic: 6 Time: 0.03696
TENSORRT LOG: Tactic: 7 Time: 0.029408
TENSORRT LOG: Tactic: 8 Time: 0.027488
TENSORRT LOG: Tactic: 9 Time: 0.027392
TENSORRT LOG: Tactic: 28 Time: 0.038272
TENSORRT LOG: Fastest Tactic: 5 Time: 0.026912
TENSORRT LOG: --------------- Timing Runner: ActivationLeaky178 (PointWise)
TENSORRT LOG: Tactic: 128 Time: 0.08512
TENSORRT LOG: Tactic: 256 Time: 0.07168
TENSORRT LOG: Tactic: 512 Time: 0.068384
TENSORRT LOG: Fastest Tactic: 512 Time: 0.068384

Here is a piece of the tensorrt debug log on cross-compiled:

TENSORRT LOG: --------------- Timing Runner: PWN(ActivationLeaky178) (PointWiseV2)
TENSORRT LOG: PointWiseV2 has no valid tactics for this config, skipping
TENSORRT LOG: --------------- Timing Runner: PWN(ActivationLeaky178) (PointWise)
TENSORRT LOG: Tactic: 128 Time: 0.07616
TENSORRT LOG: Tactic: 256 Time: 0.076512
TENSORRT LOG: Tactic: 512 Time: 0.077792
TENSORRT LOG: Tactic: -32 Time: 0.10848
TENSORRT LOG: Tactic: -64 Time: 0.093728
TENSORRT LOG: Tactic: -128 Time: 0.089984
TENSORRT LOG: Fastest Tactic: 128 Time: 0.07616

you can cleary see that is skipping PointWiseV2

Now aside of the build process of the executable and all environment difference can I somehow debug why is skipping the tactics? I tried passing a lot of different flags while compiling with no results.

Environment

TensorRT Version: v8.21
GPU Type: jetson xavier NX
Nvidia Driver Version: jetpack 4.6
CUDA Version: 10.2
CUDNN Version: v8.201
Operating System + Version: yocto
the jetson NX is setted to 20W 6core and the frequency are at maximum.

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

I tried with trtexec with this model:

I runned this command on jetpack4.6 system and on yocto system:

/usr/src/tensorrt/bin/trtexec --onnx=yolov4.onnx --fp16 --verbose

attached you can find the complete log of each system:
jetpack_build.txt (5.2 MB)
yocto_build.txt (4.9 MB)

here are the perfomance:
jetpack4.6:

[04/28/2022-22:48:58] [I] === Performance summary ===
[04/28/2022-22:48:58] [I] Throughput: 36.9607 qps
[04/28/2022-22:48:58] [I] Latency: min = 26.9713 ms, max = 27.135 ms, mean = 27.0446 ms, median = 27.0443 ms, percentile(99%) = 27.1266 ms
[04/28/2022-22:48:58] [I] End-to-End Host Latency: min = 26.9805 ms, max = 27.1497 ms, mean = 27.055 ms, median = 27.0519 ms, percentile(99%) = 27.1387 ms
[04/28/2022-22:48:58] [I] Enqueue Time: min = 4.22095 ms, max = 4.68652 ms, mean = 4.36693 ms, median = 4.33044 ms, percentile(99%) = 4.68213 ms
[04/28/2022-22:48:58] [I] H2D Latency: min = 0.0766602 ms, max = 0.079834 ms, mean = 0.0776475 ms, median = 0.0776367 ms, percentile(99%) = 0.0788574 ms
[04/28/2022-22:48:58] [I] GPU Compute Time: min = 26.7373 ms, max = 26.8997 ms, mean = 26.8112 ms, median = 26.8123 ms, percentile(99%) = 26.8936 ms
[04/28/2022-22:48:58] [I] D2H Latency: min = 0.139893 ms, max = 0.160583 ms, mean = 0.155763 ms, median = 0.155914 ms, percentile(99%) = 0.160156 ms
[04/28/2022-22:48:58] [I] Total Host Walltime: 3.03024 s
[04/28/2022-22:48:58] [I] Total GPU Compute Time: 3.00285 s
[04/28/2022-22:48:58] [I] Explanations of the performance metrics are printed in the verbose logs.

yocto:

[04/28/2022-14:41:37] [I] === Performance summary ===
[04/28/2022-14:41:37] [I] Throughput: 21.8464 qps
[04/28/2022-14:41:37] [I] Latency: min = 45.6922 ms, max = 45.8469 ms, mean = 45.7634 ms, median = 45.7648 ms, percentile(99%) = 45.8469 ms
[04/28/2022-14:41:37] [I] End-to-End Host Latency: min = 45.7003 ms, max = 45.8577 ms, mean = 45.7735 ms, median = 45.7744 ms, percentile(99%) = 45.8577 ms
[04/28/2022-14:41:37] [I] Enqueue Time: min = 4.45667 ms, max = 5.1756 ms, mean = 4.70056 ms, median = 4.64478 ms, percentile(99%) = 5.1756 ms
[04/28/2022-14:41:37] [I] H2D Latency: min = 0.0773926 ms, max = 0.0800781 ms, mean = 0.0788337 ms, median = 0.0788574 ms, percentile(99%) = 0.0800781 ms
[04/28/2022-14:41:37] [I] GPU Compute Time: min = 45.4543 ms, max = 45.6099 ms, mean = 45.5267 ms, median = 45.5281 ms, percentile(99%) = 45.6099 ms
[04/28/2022-14:41:37] [I] D2H Latency: min = 0.141357 ms, max = 0.163391 ms, mean = 0.157818 ms, median = 0.157959 ms, percentile(99%) = 0.163391 ms
[04/28/2022-14:41:37] [I] Total Host Walltime: 3.06687 s
[04/28/2022-14:41:37] [I] Total GPU Compute Time: 3.05029 s
[04/28/2022-14:41:37] [I] Explanations of the performance metrics are printed in the verbose logs.
[04/28/2022-14:41:37] [V] 

looking at the log, the building on yocto is skipping PointwiseV2 tactic.

Hi,

We are moving this post to jetson xavier NX forum to get better help.

Thank you.

Hi,

Would you mind sharing more information about the ‘cross-compiled’?
Do you generate the engine on another platform and copy the engine to XavierNX?

Or the engine is compiled on JetPack4.6 and copied into the yocto for inference? (both on the NX)
If this, would you mind sharing the detailed software used in both environment with us?

Thanks.

Hi,
In the end I solved the issue, it seems the yocto build was missing some cuda libraries.
Unfortunately tesnsorrt was not complaining about it and just run with handicap.

Thanks for the update.
Good to know the issue is solved.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.