Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works

Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works. Then I reduce image resolution, FP16 tensorrt engine (DLAcore) also can be converted. The error is:

Hi,

We want to reproduce this issue internally.
Could you share the model and the command you used with us?

Thanks.

our.onnx (5.0 MB)
trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback

Hi,

We can run your model with TensorRT 8.4 (JetPack 5.0.1 DP).
Could you give it a try?

$ /usr/src/tensorrt/bin/trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback
[07/21/2022-04:02:42] [I] === Model Options ===
[07/21/2022-04:02:42] [I] Format: ONNX
[07/21/2022-04:02:42] [I] Model: our.onnx
[07/21/2022-04:02:42] [I] Output:
[07/21/2022-04:02:42] [I] === Build Options ===
[07/21/2022-04:02:42] [I] Max batch: explicit batch
[07/21/2022-04:02:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[07/21/2022-04:02:42] [I] minTiming: 1
[07/21/2022-04:02:42] [I] avgTiming: 8
[07/21/2022-04:02:42] [I] Precision: FP32+FP16
[07/21/2022-04:02:42] [I] LayerPrecisions:
[07/21/2022-04:02:42] [I] Calibration:
[07/21/2022-04:02:42] [I] Refit: Disabled
[07/21/2022-04:02:42] [I] Sparsity: Disabled
[07/21/2022-04:02:42] [I] Safe mode: Disabled
[07/21/2022-04:02:42] [I] DirectIO mode: Disabled
[07/21/2022-04:02:42] [I] Restricted mode: Disabled
[07/21/2022-04:02:42] [I] Build only: Disabled
[07/21/2022-04:02:42] [I] Save engine:
[07/21/2022-04:02:42] [I] Load engine:
[07/21/2022-04:02:42] [I] Profiling verbosity: 0
[07/21/2022-04:02:42] [I] Tactic sources: Using default tactic sources
[07/21/2022-04:02:42] [I] timingCacheMode: local
[07/21/2022-04:02:42] [I] timingCacheFile:
[07/21/2022-04:02:42] [I] Input(s)s format: fp32:CHW
[07/21/2022-04:02:42] [I] Output(s)s format: fp32:CHW
[07/21/2022-04:02:42] [I] Input build shapes: model
[07/21/2022-04:02:42] [I] Input calibration shapes: model
[07/21/2022-04:02:42] [I] === System Options ===
[07/21/2022-04:02:42] [I] Device: 0
[07/21/2022-04:02:42] [I] DLACore: 0(With GPU fallback)
[07/21/2022-04:02:42] [I] Plugins:
[07/21/2022-04:02:42] [I] === Inference Options ===
[07/21/2022-04:02:42] [I] Batch: Explicit
[07/21/2022-04:02:42] [I] Input inference shapes: model
[07/21/2022-04:02:42] [I] Iterations: 10
[07/21/2022-04:02:42] [I] Duration: 3s (+ 200ms warm up)
[07/21/2022-04:02:42] [I] Sleep time: 0ms
[07/21/2022-04:02:42] [I] Idle time: 0ms
[07/21/2022-04:02:42] [I] Streams: 1
[07/21/2022-04:02:42] [I] ExposeDMA: Disabled
[07/21/2022-04:02:42] [I] Data transfers: Enabled
[07/21/2022-04:02:42] [I] Spin-wait: Disabled
[07/21/2022-04:02:42] [I] Multithreading: Disabled
[07/21/2022-04:02:42] [I] CUDA Graph: Disabled
[07/21/2022-04:02:42] [I] Separate profiling: Disabled
[07/21/2022-04:02:42] [I] Time Deserialize: Disabled
[07/21/2022-04:02:42] [I] Time Refit: Disabled
[07/21/2022-04:02:42] [I] Inputs:
[07/21/2022-04:02:42] [I] === Reporting Options ===
[07/21/2022-04:02:42] [I] Verbose: Disabled
[07/21/2022-04:02:42] [I] Averages: 10 inferences
[07/21/2022-04:02:42] [I] Percentile: 99
[07/21/2022-04:02:42] [I] Dump refittable layers:Disabled
[07/21/2022-04:02:42] [I] Dump output: Disabled
[07/21/2022-04:02:42] [I] Profile: Disabled
[07/21/2022-04:02:42] [I] Export timing to JSON file:
[07/21/2022-04:02:42] [I] Export output to JSON file:
[07/21/2022-04:02:42] [I] Export profile to JSON file:
[07/21/2022-04:02:42] [I]
[07/21/2022-04:02:42] [I] === Device Information ===
[07/21/2022-04:02:42] [I] Selected Device: Xavier
[07/21/2022-04:02:42] [I] Compute Capability: 7.2
[07/21/2022-04:02:42] [I] SMs: 8
[07/21/2022-04:02:42] [I] Compute Clock Rate: 1.377 GHz
[07/21/2022-04:02:42] [I] Device Global Memory: 14907 MiB
[07/21/2022-04:02:42] [I] Shared Memory per SM: 96 KiB
[07/21/2022-04:02:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/21/2022-04:02:42] [I] Memory Clock Rate: 1.377 GHz
[07/21/2022-04:02:42] [I]
[07/21/2022-04:02:42] [I] TensorRT version: 8.4.0
[07/21/2022-04:02:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +206, GPU +0, now: CPU 231, GPU 4658 (MiB)
[07/21/2022-04:02:45] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +141, GPU +132, now: CPU 391, GPU 4810 (MiB)
[07/21/2022-04:02:45] [I] Start parsing network model
[07/21/2022-04:02:45] [I] [TRT] ----------------------------------------------------------------
[07/21/2022-04:02:45] [I] [TRT] Input filename:   our.onnx
[07/21/2022-04:02:45] [I] [TRT] ONNX IR version:  0.0.6
[07/21/2022-04:02:45] [I] [TRT] Opset version:    9
[07/21/2022-04:02:45] [I] [TRT] Producer name:    pytorch
[07/21/2022-04:02:45] [I] [TRT] Producer version: 1.8
[07/21/2022-04:02:45] [I] [TRT] Domain:
[07/21/2022-04:02:45] [I] [TRT] Model version:    0
[07/21/2022-04:02:45] [I] [TRT] Doc string:
[07/21/2022-04:02:45] [I] [TRT] ----------------------------------------------------------------
[07/21/2022-04:02:45] [I] Finish parsing network model
[07/21/2022-04:02:48] [I] [TRT] ---------- Layers Running on DLA ----------
[07/21/2022-04:02:48] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Conv_24]}
[07/21/2022-04:02:48] [I] [TRT] ---------- Layers Running on GPU ----------
[07/21/2022-04:02:49] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +260, GPU +213, now: CPU 657, GPU 5064 (MiB)
[07/21/2022-04:02:49] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +84, GPU +85, now: CPU 741, GPU 5149 (MiB)
[07/21/2022-04:02:49] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/21/2022-04:02:55] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[07/21/2022-04:02:55] [I] [TRT] Total Host Persistent Memory: 864
[07/21/2022-04:02:55] [I] [TRT] Total Device Persistent Memory: 0
[07/21/2022-04:02:55] [I] [TRT] Total Scratch Memory: 0
[07/21/2022-04:02:55] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 5 MiB, GPU 19 MiB
[07/21/2022-04:02:55] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.015873ms to assign 1 blocks to 1 nodes requiring 614400 bytes.
[07/21/2022-04:02:55] [I] [TRT] Total Activation Memory: 614400
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +5, GPU +0, now: CPU 5, GPU 0 (MiB)
[07/21/2022-04:02:55] [I] Engine built in 12.5473 sec.
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 602, GPU 5163 (MiB)
[07/21/2022-04:02:55] [I] [TRT] Loaded engine size: 5 MiB
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +5, GPU +0, now: CPU 5, GPU 0 (MiB)
[07/21/2022-04:02:55] [I] Engine deserialized in 0.00262346 sec.
[07/21/2022-04:02:55] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 5, GPU 0 (MiB)
[07/21/2022-04:02:55] [I] Using random values for input input
[07/21/2022-04:02:55] [I] Created input binding for input with dimensions 1x1x480x640
[07/21/2022-04:02:55] [I] Using random values for output score
[07/21/2022-04:02:55] [I] Created output binding for score with dimensions 1x65x60x80
[07/21/2022-04:02:55] [I] Using random values for output desc
[07/21/2022-04:02:55] [I] Created output binding for desc with dimensions 1x256x60x80
[07/21/2022-04:02:55] [I] Starting inference
[07/21/2022-04:02:58] [I] Warmup completed 7 queries over 200 ms
[07/21/2022-04:02:58] [I] Timing trace has 106 queries over 3.08984 s
[07/21/2022-04:02:58] [I]
[07/21/2022-04:02:58] [I] === Trace details ===
[07/21/2022-04:02:58] [I] Trace averages of 10 runs:
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8791 ms - Host latency: 29.1076 ms (enqueue 28.7004 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8767 ms - Host latency: 29.1048 ms (enqueue 28.7048 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8862 ms - Host latency: 29.121 ms (enqueue 28.6726 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.888 ms - Host latency: 29.1277 ms (enqueue 28.6472 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8832 ms - Host latency: 29.1243 ms (enqueue 28.6262 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8791 ms - Host latency: 29.1284 ms (enqueue 28.5892 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8692 ms - Host latency: 29.1119 ms (enqueue 28.6041 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8689 ms - Host latency: 29.1111 ms (enqueue 28.5894 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8745 ms - Host latency: 29.1201 ms (enqueue 28.5975 ms)
[07/21/2022-04:02:58] [I] Average on 10 runs - GPU latency: 28.8682 ms - Host latency: 29.1143 ms (enqueue 28.577 ms)
[07/21/2022-04:02:58] [I]
[07/21/2022-04:02:58] [I] === Performance summary ===
[07/21/2022-04:02:58] [I] Throughput: 34.306 qps
[07/21/2022-04:02:58] [I] Latency: min = 29.0828 ms, max = 29.1612 ms, mean = 29.1168 ms, median = 29.116 ms, percentile(99%) = 29.1599 ms
[07/21/2022-04:02:58] [I] Enqueue Time: min = 28.4871 ms, max = 28.7714 ms, mean = 28.6265 ms, median = 28.6235 ms, percentile(99%) = 28.7587 ms
[07/21/2022-04:02:58] [I] H2D Latency: min = 0.041153 ms, max = 0.081665 ms, mean = 0.0456055 ms, median = 0.0445557 ms, percentile(99%) = 0.0662842 ms
[07/21/2022-04:02:58] [I] GPU Compute Time: min = 28.8456 ms, max = 28.9236 ms, mean = 28.8767 ms, median = 28.8766 ms, percentile(99%) = 28.9207 ms
[07/21/2022-04:02:58] [I] D2H Latency: min = 0.164307 ms, max = 0.206299 ms, mean = 0.194463 ms, median = 0.196594 ms, percentile(99%) = 0.206055 ms
[07/21/2022-04:02:58] [I] Total Host Walltime: 3.08984 s
[07/21/2022-04:02:58] [I] Total GPU Compute Time: 3.06093 s
[07/21/2022-04:02:58] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/21/2022-04:02:58] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/21/2022-04:02:58] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/21/2022-04:02:58] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --onnx=our.onnx --useDLACore=0 --fp16 --allowGPUFallback

Thanks.

Okay, it can not run with with TensorRT 8.2.1 (JetPack 4.6.1).

Hi,

The DLA version is different.
So it might contain some fix/support to solve this issue.

Thanks.

So we have no solution other than updating version?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.