Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works

user71282 · July 13, 2022, 3:35am

Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works. Then I reduce image resolution, FP16 tensorrt engine (DLAcore) also can be converted. The error is:

AastaLLL · July 13, 2022, 5:36am

Hi,

We want to reproduce this issue internally.
Could you share the model and the command you used with us?

Thanks.

user71282 · July 14, 2022, 6:17am

our.onnx (5.0 MB)
trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback

AastaLLL · July 21, 2022, 4:04am

Hi,

We can run your model with TensorRT 8.4 (JetPack 5.0.1 DP).
Could you give it a try?

$ /usr/src/tensorrt/bin/trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback
[07/21/2022-04:02:42] [I] === Model Options ===
[07/21/2022-04:02:42] [I] Format: ONNX
[07/21/2022-04:02:42] [I] Model: our.onnx
[07/21/2022-04:02:42] [I] Output:
[07/21/2022-04:02:42] [I] === Build Options ===
[07/21/2022-04:02:42] [I] Max batch: explicit batch
[07/21/2022-04:02:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[07/21/2022-04:02:42] [I] minTiming: 1
[07/21/2022-04:02:42] [I] avgTiming: 8
[07/21/2022-04:02:42] [I] Precision: FP32+FP16
[07/21/2022-04:02:42] [I] LayerPrecisions:
[07/21/2022-04:02:42] [I] Calibration:
[07/21/2022-04:02:42] [I] Refit: Disabled
[07/21/2022-04:02:42] [I] Sparsity: Disabled
[07/21/2022-04:02:42] [I] Safe mode: Disabled
[07/21/2022-04:02:42] [I] DirectIO mode: Disabled
[07/21/2022-04:02:42] [I] Restricted mode: Disabled
[07/21/2022-04:02:42] [I] Build only: Disabled
[07/21/2022-04:02:42] [I] Save engine:
[07/21/2022-04:02:42] [I] Load engine:
[07/21/2022-04:02:42] [I] Profiling verbosity: 0
[07/21/2022-04:02:42] [I] Tactic sources: Using default tactic sources
[07/21/2022-04:02:42] [I] timingCacheMode: local
[07/21/2022-04:02:42] [I] timingCacheFile:
[07/21/2022-04:02:42] [I] Input(s)s format: fp32:CHW
[07/21/2022-04:02:42] [I] Output(s)s format: fp32:CHW
[07/21/2022-04:02:42] [I] Input build shapes: model
[07/21/2022-04:02:42] [I] Input calibration shapes: model
[07/21/2022-04:02:42] [I] === System Options ===
[07/21/2022-04:02:42] [I] Device: 0
[07/21/2022-04:02:42] [I] DLACore: 0(With GPU fallback)
[07/21/2022-04:02:42] [I] Plugins:
[07/21/2022-04:02:42] [I] === Inference Options ===
[07/21/2022-04:02:42] [I] Batch: Explicit
[07/21/2022-04:02:42] [I] Input inference shapes: model
[07/21/2022-04:02:42] [I] Iterations: 10
[07/21/2022-04:02:42] [I] Duration: 3s (+ 200ms warm up)
[07/21/2022-04:02:42] [I] Sleep time: 0ms
[07/21/2022-04:02:42] [I] Idle time: 0ms
[07/21/2022-04:02:42] [I] Streams: 1
[07/21/2022-04:02:42] [I] ExposeDMA: Disabled
[07/21/2022-04:02:42] [I] Data transfers: Enabled
[07/21/2022-04:02:42] [I] Spin-wait: Disabled
[07/21/2022-04:02:42] [I] Multithreading: Disabled
[07/21/2022-04:02:42] [I] CUDA Graph: Disabled
[07/21/2022-04:02:42] [I] Separate profiling: Disabled
[07/21/2022-04:02:42] [I] Time Deserialize: Disabled
[07/21/2022-04:02:42] [I] Time Refit: Disabled
[07/21/2022-04:02:42] [I] Inputs:
[07/21/2022-04:02:42] [I] === Reporting Options ===
[07/21/2022-04:02:42] [I] Verbose: Disabled
[07/21/2022-04:02:42] [I] Averages: 10 inferences
[07/21/2022-04:02:42] [I] Percentile: 99
[07/21/2022-04:02:42] [I] Dump refittable layers:Disabled
[07/21/2022-04:02:42] [I] Dump output: Disabled
[07/21/2022-04:02:42] [I] Profile: Disabled
[07/21/2022-04:02:42] [I] Export timing to JSON file:
[07/21/2022-04:02:42] [I] Export output to JSON file:
[07/21/2022-04:02:42] [I] Export profile to JSON file:
[07/21/2022-04:02:42] [I]
[07/21/2022-04:02:42] [I] === Device Information ===
[07/21/2022-04:02:42] [I] Selected Device: Xavier
[07/21/2022-04:02:42] [I] Compute Capability: 7.2
[07/21/2022-04:02:42] [I] SMs: 8
[07/21/2022-04:02:42] [I] Compute Clock Rate: 1.377 GHz
[07/21/2022-04:02:42] [I] Device Global Memory: 14907 MiB
[07/21/2022-04:02:42] [I] Shared Memory per SM: 96 KiB
[07/21/2022-04:02:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/21/2022-04:02:42] [I] Memory Clock Rate: 1.377 GHz
[07/21/2022-04:02:42] [I]
[07/21/2022-04:02:42] [I] TensorRT version: 8.4.0
[07/21/2022-04:02:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +206, GPU +0, now: CPU 231, GPU 4658 (MiB)
[07/21/2022-04:02:45] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +141, GPU +132, now: CPU 391, GPU 4810 (MiB)
[07/21/2022-04:02:45] [I] Start parsing network model
[07/21/2022-04:02:45] [I] [TRT] ----------------------------------------------------------------
[07/21/2022-04:02:45] [I] [TRT] Input filename:   our.onnx
[07/21/2022-04:02:45] [I] [TRT] ONNX IR version:  0.0.6
[07/21/2022-04:02:45] [I] [TRT] Opset version:    9
[07/21/2022-04:02:45] [I] [TRT] Producer name:    pytorch
[07/21/2022-04:02:45] [I] [TRT] Producer version: 1.8
[07/21/2022-04:02:45] [I] [TRT] Domain:
[07/21/2022-04:02:45] [I] [TRT] Model version:    0
[07/21/2022-04:02:45] [I] [TRT] Doc string:
[07/21/2022-04:02:45] [I] [TRT] ----------------------------------------------------------------
[07/21/2022-04:02:45] [I] Finish parsing network model
[07/21/2022-04:02:48] [I] [TRT] ---------- Layers Running on DLA ----------
[07/21/2022-04:02:48] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Conv_24]}
[07/21/2022-04:02:48] [I] [TRT] ---------- Layers Running on GPU ----------
[07/21/2022-04:02:49] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +260, GPU +213, now: CPU 657, GPU 5064 (MiB)
[07/21/2022-04:02:49] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +84, GPU +85, now: CPU 741, GPU 5149 (MiB)
[07/21/2022-04:02:49] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/21/2022-04:02:55] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[07/21/2022-04:02:55] [I] [TRT] Total Host Persistent Memory: 864
[07/21/2022-04:02:55] [I] [TRT] Total Device Persistent Memory: 0
[07/21/2022-04:02:55] [I] [TRT] Total Scratch Memory: 0
[07/21/2022-04:02:55] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 5 MiB, GPU 19 MiB
[07/21/2022-04:02:55] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.015873ms to assign 1 blocks to 1 nodes requiring 614400 bytes.
[07/21/2022-04:02:55] [I] [TRT] Total Activation Memory: 614400
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +5, GPU +0, now: CPU 5, GPU 0 (MiB)
[07/21/2022-04:02:55] [I] Engine built in 12.5473 sec.
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 602, GPU 5163 (MiB)
[07/21/2022-04:02:55] [I] [TRT] Loaded engine size: 5 MiB
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +5, GPU +0, now: CPU 5, GPU 0 (MiB)
[07/21/2022-04:02:55] [I] Engine deserialized in 0.00262346 sec.
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 5, GPU 0 (MiB)
[07/21/2022-04:02:55] [I] Using random values for input input
[07/21/2022-04:02:55] [I] Created input binding for input with dimensions 1x1x480x640
[07/21/2022-04:02:55] [I] Using random values for output score
[07/21/2022-04:02:55] [I] Created output binding for score with dimensions 1x65x60x80
[07/21/2022-04:02:55] [I] Using random values for output desc
[07/21/2022-04:02:55] [I] Created output binding for desc with dimensions 1x256x60x80
[07/21/2022-04:02:55] [I] Starting inference
[07/21/2022-04:02:58] [I] Warmup completed 7 queries over 200 ms
[07/21/2022-04:02:58] [I] Timing trace has 106 queries over 3.08984 s
[07/21/2022-04:02:58] [I]
[07/21/2022-04:02:58] [I] === Trace details ===
[07/21/2022-04:02:58] [I] Trace averages of 10 runs:
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8791 ms - Host latency: 29.1076 ms (enqueue 28.7004 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8767 ms - Host latency: 29.1048 ms (enqueue 28.7048 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8862 ms - Host latency: 29.121 ms (enqueue 28.6726 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.888 ms - Host latency: 29.1277 ms (enqueue 28.6472 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8832 ms - Host latency: 29.1243 ms (enqueue 28.6262 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8791 ms - Host latency: 29.1284 ms (enqueue 28.5892 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8692 ms - Host latency: 29.1119 ms (enqueue 28.6041 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8689 ms - Host latency: 29.1111 ms (enqueue 28.5894 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8745 ms - Host latency: 29.1201 ms (enqueue 28.5975 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8682 ms - Host latency: 29.1143 ms (enqueue 28.577 ms)
[07/21/2022-04:02:58] [I]
[07/21/2022-04:02:58] [I] === Performance summary ===
[07/21/2022-04:02:58] [I] Throughput: 34.306 qps
[07/21/2022-04:02:58] [I] Latency: min = 29.0828 ms, max = 29.1612 ms, mean = 29.1168 ms, median = 29.116 ms, percentile(99%) = 29.1599 ms
[07/21/2022-04:02:58] [I] Enqueue Time: min = 28.4871 ms, max = 28.7714 ms, mean = 28.6265 ms, median = 28.6235 ms, percentile(99%) = 28.7587 ms
[07/21/2022-04:02:58] [I] H2D Latency: min = 0.041153 ms, max = 0.081665 ms, mean = 0.0456055 ms, median = 0.0445557 ms, percentile(99%) = 0.0662842 ms
[07/21/2022-04:02:58] [I] GPU Compute Time: min = 28.8456 ms, max = 28.9236 ms, mean = 28.8767 ms, median = 28.8766 ms, percentile(99%) = 28.9207 ms
[07/21/2022-04:02:58] [I] D2H Latency: min = 0.164307 ms, max = 0.206299 ms, mean = 0.194463 ms, median = 0.196594 ms, percentile(99%) = 0.206055 ms
[07/21/2022-04:02:58] [I] Total Host Walltime: 3.08984 s
[07/21/2022-04:02:58] [I] Total GPU Compute Time: 3.06093 s
[07/21/2022-04:02:58] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/21/2022-04:02:58] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/21/2022-04:02:58] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/21/2022-04:02:58] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback

Thanks.

user71282 · July 22, 2022, 2:13am

Okay, it can not run with with TensorRT 8.2.1 (JetPack 4.6.1).

AastaLLL · July 22, 2022, 3:52am

Hi,

The DLA version is different.
So it might contain some fix/support to solve this issue.

Thanks.

user71282 · July 22, 2022, 4:55am

So we have no solution other than updating version?

system · August 10, 2022, 6:07am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Engine creation fails when using DLA with GPU fallback Jetson AGX Xavier tensorrt , dla	11	2031	March 22, 2022
Can't build engines with the DLA + INT8 in Jetson Xavier NX Jetson Xavier NX tensorrt , dla	13	1284	October 18, 2021
TensorRT run ONNX model with Int8 issue TensorRT	9	4280	October 12, 2021
Cannot build a TensorRT engine for DLA from a large ONNX file Jetson Xavier NX tensorrt , nvbugs , dla	12	2646	July 21, 2021
Trtexec log problem and use DLA error on Jetson Xavier Jetson AGX Xavier dla	7	1586	October 18, 2021
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1913	June 14, 2021
TensorRT cannot parse ONNX model TensorRT	5	1825	June 18, 2020
Failed converting ONNX model to TensorRT model TensorRT	3	2072	June 13, 2022
Resize incompatibility when generating a full TensorRT engine for DLA Jetson Orin NX tensorrt , dla	8	750	February 23, 2023
Trtexec failed to generate engine (Internal Error) with DLA Jetson Orin NX tensorrt , nvbugs , dla	7	1073	April 8, 2024

Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works

Related topics