Tensor RT optimization causes performance downgrade compared to onnx model

Description

Have an performance downgrade issue when transfering onnx model to Tensor RT.

Environment

TensorRT Version: 6.0
GPU Type: Nvidia-jetson nano
Nvidia Driver Version: Jecpack 4.3
CUDA Version: 10.2
CUDNN Version:
Operating System + Version: Jecpack 4.3

few experiences in modifying onnx <-> Tensor RT model transformation. i have an issue with performance downgrade when changing some of my specific onnx model to Tensor RT.
In detail, i have an action recognition model, based on ST-GCN graph, and when i changed onnx ST-GCN model into Tensor RT, the performance degraded.

It takes 0.08 ms when ONNX model inferenced, and 300ms when Tensor RT model inferenced.[experiencing on Jetson nano]

also, when i check two differece torch models**[mobilenet v3 / above action recognition model] ** on RTX-3090, it takes 1ms to inference for each model. but when bringing model into Jetson Nano, the TensorRT mobilenet v3 inference time to 6ms, and my action recognition TensorRT model inference speed to 300ms.

to sum it up,

  • i want to know the main cause issue about why there is a big speed differences between TRT - models. [ the models inference speed on RTX-3090 is same]
  • is there any speed downgrade issue when changing model to TRT?

ps. what i’m thinking about downgrade issue is, my action recognition model has an input with shape (N,C,T,V,M) and, Tensor RT isn’t optimizing 5 channel input shape…

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

Thanks for answering @NVES !!

my output is like below.

1. check_model.py

: my Onnx model runs well with onnx.checker.check_model(model) .

2. TRT EXEC perf.

: this is my TRT EXEC performance with -verbose options

_ver4_light_SGD.engine --warmUp=100 --verbose --exportTimes=trace.json
[01/26/2022-20:20:51] [I] === Model Options ===
[01/26/2022-20:20:51] [I] Format: *
[01/26/2022-20:20:51] [I] Model: 
[01/26/2022-20:20:51] [I] Output:
[01/26/2022-20:20:51] [I] === Build Options ===
[01/26/2022-20:20:51] [I] Max batch: 1
[01/26/2022-20:20:51] [I] Workspace: 16 MiB
[01/26/2022-20:20:51] [I] minTiming: 1
[01/26/2022-20:20:51] [I] avgTiming: 8
[01/26/2022-20:20:51] [I] Precision: FP32
[01/26/2022-20:20:51] [I] Calibration: 
[01/26/2022-20:20:51] [I] Refit: Disabled
[01/26/2022-20:20:51] [I] Sparsity: Disabled
[01/26/2022-20:20:51] [I] Safe mode: Disabled
[01/26/2022-20:20:51] [I] Restricted mode: Disabled
[01/26/2022-20:20:51] [I] Save engine: 
[01/26/2022-20:20:51] [I] Load engine: /home/butlely/Desktop/sangin/butlely/test_file/model_action_recognition_final_ver4_light_SGD.engine
[01/26/2022-20:20:51] [I] NVTX verbosity: 0
[01/26/2022-20:20:51] [I] Tactic sources: Using default tactic sources
[01/26/2022-20:20:51] [I] timingCacheMode: local
[01/26/2022-20:20:51] [I] timingCacheFile: 
[01/26/2022-20:20:51] [I] Input(s)s format: fp32:CHW
[01/26/2022-20:20:51] [I] Output(s)s format: fp32:CHW
[01/26/2022-20:20:51] [I] Input build shapes: model
[01/26/2022-20:20:51] [I] Input calibration shapes: model
[01/26/2022-20:20:51] [I] === System Options ===
[01/26/2022-20:20:51] [I] Device: 0
[01/26/2022-20:20:51] [I] DLACore: 
[01/26/2022-20:20:51] [I] Plugins:
[01/26/2022-20:20:51] [I] === Inference Options ===
[01/26/2022-20:20:51] [I] Batch: 1
[01/26/2022-20:20:51] [I] Input inference shapes: model
[01/26/2022-20:20:51] [I] Iterations: 10
[01/26/2022-20:20:51] [I] Duration: 3s (+ 100ms warm up)
[01/26/2022-20:20:51] [I] Sleep time: 0ms
[01/26/2022-20:20:51] [I] Streams: 1
[01/26/2022-20:20:51] [I] ExposeDMA: Disabled
[01/26/2022-20:20:51] [I] Data transfers: Enabled
[01/26/2022-20:20:51] [I] Spin-wait: Disabled
[01/26/2022-20:20:51] [I] Multithreading: Disabled
[01/26/2022-20:20:51] [I] CUDA Graph: Disabled
[01/26/2022-20:20:51] [I] Separate profiling: Disabled
[01/26/2022-20:20:51] [I] Time Deserialize: Disabled
[01/26/2022-20:20:51] [I] Time Refit: Disabled
[01/26/2022-20:20:51] [I] Skip inference: Disabled
[01/26/2022-20:20:51] [I] Inputs:
[01/26/2022-20:20:51] [I] === Reporting Options ===
[01/26/2022-20:20:51] [I] Verbose: Enabled
[01/26/2022-20:20:51] [I] Averages: 10 inferences
[01/26/2022-20:20:51] [I] Percentile: 99
[01/26/2022-20:20:51] [I] Dump refittable layers:Disabled
[01/26/2022-20:20:51] [I] Dump output: Disabled
[01/26/2022-20:20:51] [I] Profile: Disabled
[01/26/2022-20:20:51] [I] Export timing to JSON file: trace.json
[01/26/2022-20:20:51] [I] Export output to JSON file: 
[01/26/2022-20:20:51] [I] Export profile to JSON file: 
[01/26/2022-20:20:51] [I] 
[01/26/2022-20:20:51] [I] === Device Information ===
[01/26/2022-20:20:51] [I] Selected Device: NVIDIA Tegra X1
[01/26/2022-20:20:51] [I] Compute Capability: 5.3
[01/26/2022-20:20:51] [I] SMs: 1
[01/26/2022-20:20:51] [I] Compute Clock Rate: 0.9216 GHz
[01/26/2022-20:20:51] [I] Device Global Memory: 3964 MiB
[01/26/2022-20:20:51] [I] Shared Memory per SM: 64 KiB
[01/26/2022-20:20:51] [I] Memory Bus Width: 64 bits (ECC disabled)
[01/26/2022-20:20:51] [I] Memory Clock Rate: 0.01275 GHz
[01/26/2022-20:20:51] [I] 
[01/26/2022-20:20:51] [I] TensorRT version: 8001
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::Proposal version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::Split version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[01/26/2022-20:20:51] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[01/26/2022-20:20:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 228, GPU 3011 (MiB)
[01/26/2022-20:20:54] [I] [TRT] Loaded engine size: 6 MB
[01/26/2022-20:20:54] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 228 MiB, GPU 3011 MiB
[01/26/2022-20:20:57] [V] [TRT] Using cublas a tactic source
[01/26/2022-20:20:57] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +244, now: CPU 386, GPU 3255 (MiB)
[01/26/2022-20:20:57] [V] [TRT] Using cuDNN as a tactic source
[01/26/2022-20:21:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +241, GPU +356, now: CPU 627, GPU 3611 (MiB)
[01/26/2022-20:21:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 626, GPU 3611 (MiB)
[01/26/2022-20:21:00] [V] [TRT] Deserialization required 5887655 microseconds.
[01/26/2022-20:21:00] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 626 MiB, GPU 3611 MiB
[01/26/2022-20:21:00] [I] Engine loaded in 9.31093 sec.
[01/26/2022-20:21:00] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 619 MiB, GPU 3604 MiB
[01/26/2022-20:21:00] [V] [TRT] Using cublas a tactic source
[01/26/2022-20:21:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 620, GPU 3604 (MiB)
[01/26/2022-20:21:00] [V] [TRT] Using cuDNN as a tactic source
[01/26/2022-20:21:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 620, GPU 3604 (MiB)
[01/26/2022-20:21:00] [V] [TRT] Total per-runner device memory is 7148032
[01/26/2022-20:21:00] [V] [TRT] Total per-runner host memory is 27728
[01/26/2022-20:21:00] [V] [TRT] Allocated activation device memory of size 23623680
[01/26/2022-20:21:01] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 622 MiB, GPU 3639 MiB
[01/26/2022-20:21:01] [I] Created input binding for 0 with dimensions 1x3x50x15x1
[01/26/2022-20:21:01] [I] Created output binding for 288 with dimensions 1
[01/26/2022-20:21:01] [I] Starting inference
[01/26/2022-20:21:08] [I] Warmup completed 1 queries over 100 ms
[01/26/2022-20:21:08] [I] Timing trace has 10 queries over 3.19113 s
[01/26/2022-20:21:08] [I] 
[01/26/2022-20:21:08] [I] === Trace details ===
[01/26/2022-20:21:08] [I] Trace averages of 10 runs:
[01/26/2022-20:21:08] [I] Average on 10 runs - GPU latency: 319.087 ms - Host latency: 319.099 ms (end to end 319.112 ms, enqueue 1.78638 ms)
[01/26/2022-20:21:08] [I] 
[01/26/2022-20:21:08] [I] === Performance summary ===
[01/26/2022-20:21:08] [I] Throughput: 3.13369 qps
[01/26/2022-20:21:08] [I] Latency: min = 316.598 ms, max = 320.745 ms, mean = 319.099 ms, median = 319.11 ms, percentile(99%) = 320.745 ms
[01/26/2022-20:21:08] [I] End-to-End Host Latency: min = 316.61 ms, max = 320.757 ms, mean = 319.112 ms, median = 319.119 ms, percentile(99%) = 320.757 ms
[01/26/2022-20:21:08] [I] Enqueue Time: min = 1.41821 ms, max = 1.94971 ms, mean = 1.78638 ms, median = 1.80994 ms, percentile(99%) = 1.94971 ms
[01/26/2022-20:21:08] [I] H2D Latency: min = 0.00195312 ms, max = 0.0646973 ms, mean = 0.00852051 ms, median = 0.00244141 ms, percentile(99%) = 0.0646973 ms
[01/26/2022-20:21:08] [I] GPU Compute Time: min = 316.592 ms, max = 320.739 ms, mean = 319.087 ms, median = 319.105 ms, percentile(99%) = 320.739 ms
[01/26/2022-20:21:08] [I] D2H Latency: min = 0.00146484 ms, max = 0.00439453 ms, mean = 0.0034668 ms, median = 0.00341797 ms, percentile(99%) = 0.00439453 ms
[01/26/2022-20:21:08] [I] Total Host Walltime: 3.19113 s
[01/26/2022-20:21:08] [I] Total GPU Compute Time: 3.19087 s
[01/26/2022-20:21:08] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/26/2022-20:21:08] [V] 
[01/26/2022-20:21:08] [V] === Explanations of the performance metrics ===
[01/26/2022-20:21:08] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[01/26/2022-20:21:08] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[01/26/2022-20:21:08] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/26/2022-20:21:08] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[01/26/2022-20:21:08] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[01/26/2022-20:21:08] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[01/26/2022-20:21:08] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[01/26/2022-20:21:08] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[01/26/2022-20:21:08] [V] End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.
[01/26/2022-20:21:08] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8001]

the precision speed with fp32 and fp16 doesn’t make difference. so i upload better performance TRTengine logging with fp32

[01/26/2022-20:32:09] [I] === Profile (11 iterations ) ===
[01/26/2022-20:32:09] [I]                                                   Layer   Time (ms)   Avg. Time (ms)   Time %
[01/26/2022-20:32:09] [I]  Transpose_0 + Reshape_1 + (Unnamed Layer* 2) [Shuffle]        1.67           0.1517      0.0
[01/26/2022-20:32:09] [I]                                    BatchNormalization_2        1.35           0.1225      0.0
[01/26/2022-20:32:09] [I]  (Unnamed Layer* 4) [Shuffle] + Reshape_3 + Transpose_4        1.39           0.1263      0.0
[01/26/2022-20:32:09] [I]                                                  Conv_6        2.45           0.2230      0.0
[01/26/2022-20:32:09] [I]                                                     334        0.07           0.0061      0.0
[01/26/2022-20:32:09] [I]                                                   Mul_9       67.77           6.1605      1.3
[01/26/2022-20:32:09] [I]                                            ReduceSum_10      505.78          45.9799      9.7
[01/26/2022-20:32:09] [I]                                            ReduceSum_11       59.21           5.3826      1.1
[01/26/2022-20:32:09] [I]                         BatchNormalization_12 + Relu_13        0.49           0.0447      0.0
[01/26/2022-20:32:09] [I]                                       Conv_14 + Relu_17        4.04           0.3669      0.1
[01/26/2022-20:32:09] [I]                                                 Conv_18        1.90           0.1727      0.0
[01/26/2022-20:32:09] [I]                                                     344        0.01           0.0008      0.0
[01/26/2022-20:32:09] [I]                                                  Mul_21       37.82           3.4382      0.7
[01/26/2022-20:32:09] [I]                                            ReduceSum_22      429.97          39.0886      8.2
[01/26/2022-20:32:09] [I]                                            ReduceSum_23       58.69           5.3359      1.1
[01/26/2022-20:32:09] [I]                         BatchNormalization_24 + Relu_25        0.47           0.0432      0.0
[01/26/2022-20:32:09] [I]                              Conv_26 + Add_27 + Relu_28        3.99           0.3626      0.1
[01/26/2022-20:32:09] [I]                                                 Conv_30        3.83           0.3483      0.1
[01/26/2022-20:32:09] [I]                                                     354        0.02           0.0015      0.0
[01/26/2022-20:32:09] [I]                                                  Mul_33       75.59           6.8716      1.4
[01/26/2022-20:32:09] [I]                                            ReduceSum_34      853.64          77.6032     16.4
[01/26/2022-20:32:09] [I]                                            ReduceSum_35      108.72           9.8837      2.1
[01/26/2022-20:32:09] [I]                         BatchNormalization_36 + Relu_37        0.82           0.0741      0.0
[01/26/2022-20:32:09] [I]                                                 Conv_38        7.04           0.6398      0.1
[01/26/2022-20:32:09] [I]                              Conv_29 + Add_39 + Relu_40        0.77           0.0703      0.0
[01/26/2022-20:32:09] [I]                                                 Conv_42        5.51           0.5013      0.1
[01/26/2022-20:32:09] [I]                                                     364        0.02           0.0016      0.0
[01/26/2022-20:32:09] [I]                                                  Mul_45       69.12           6.2838      1.3
[01/26/2022-20:32:09] [I]                                            ReduceSum_46      816.47          74.2249     15.6
[01/26/2022-20:32:09] [I]                                            ReduceSum_47      106.64           9.6948      2.0
[01/26/2022-20:32:09] [I]                         BatchNormalization_48 + Relu_49        0.81           0.0738      0.0
[01/26/2022-20:32:09] [I]                                                 Conv_50       16.51           1.5008      0.3
[01/26/2022-20:32:09] [I]                              Conv_41 + Add_51 + Relu_52        1.40           0.1275      0.0
[01/26/2022-20:32:09] [I]                                                 Conv_53        6.14           0.5584      0.1
[01/26/2022-20:32:09] [I]                                                     374        0.02           0.0014      0.0
[01/26/2022-20:32:09] [I]                                                  Mul_56       34.91           3.1740      0.7
[01/26/2022-20:32:09] [I]                                            ReduceSum_57      419.47          38.1333      8.0
[01/26/2022-20:32:09] [I]                                            ReduceSum_58       56.55           5.1409      1.1
[01/26/2022-20:32:09] [I]                         BatchNormalization_59 + Relu_60        0.49           0.0446      0.0
[01/26/2022-20:32:09] [I]                              Conv_61 + Add_62 + Relu_63       16.56           1.5055      0.3
[01/26/2022-20:32:09] [I]                                          AveragePool_66        0.32           0.0289      0.0
[01/26/2022-20:32:09] [I]                                           ReduceMean_68        0.23           0.0208      0.0
[01/26/2022-20:32:09] [I]                                                 Conv_69     1440.25         130.9317     27.6
[01/26/2022-20:32:09] [I]                                               ArgMax_71        0.22           0.0199      0.0
[01/26/2022-20:32:09] [I]                                                   Total     5219.13         474.4666    100.0

TRT dumpProfile option to check layer speed. and i found ReduceMean layer take lots of timing in TRT Engine.
and do you have any suggestion to solve this problem?