Hi,
Which JetPack version do you use?
It seems that you have already updated the TensorRT into 8.2.1.
We test the model on TensorRT v8.0.1 included in the JetPack4.6.
It can work without issue on DLA mode.
$ /usr/src/tensorrt/bin/trtexec --onnx=./fake_quantized_detnet.onnx --int8 --useDLACore=0 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=./fake_quantized_detnet.onnx --int8 --useDLACore=0 --allowGPUFallback
[03/08/2022-00:33:53] [I] === Model Options ===
[03/08/2022-00:33:53] [I] Format: ONNX
[03/08/2022-00:33:53] [I] Model: ./fake_quantized_detnet.onnx
[03/08/2022-00:33:53] [I] Output:
[03/08/2022-00:33:53] [I] === Build Options ===
[03/08/2022-00:33:53] [I] Max batch: explicit
[03/08/2022-00:33:53] [I] Workspace: 16 MiB
[03/08/2022-00:33:53] [I] minTiming: 1
[03/08/2022-00:33:53] [I] avgTiming: 8
[03/08/2022-00:33:53] [I] Precision: FP32+INT8
[03/08/2022-00:33:53] [I] Calibration: Dynamic
[03/08/2022-00:33:53] [I] Refit: Disabled
[03/08/2022-00:33:53] [I] Sparsity: Disabled
[03/08/2022-00:33:53] [I] Safe mode: Disabled
[03/08/2022-00:33:53] [I] Restricted mode: Disabled
[03/08/2022-00:33:53] [I] Save engine:
[03/08/2022-00:33:53] [I] Load engine:
[03/08/2022-00:33:53] [I] NVTX verbosity: 0
[03/08/2022-00:33:53] [I] Tactic sources: Using default tactic sources
[03/08/2022-00:33:53] [I] timingCacheMode: local
[03/08/2022-00:33:53] [I] timingCacheFile:
[03/08/2022-00:33:53] [I] Input(s)s format: fp32:CHW
[03/08/2022-00:33:53] [I] Output(s)s format: fp32:CHW
[03/08/2022-00:33:53] [I] Input build shapes: model
[03/08/2022-00:33:53] [I] Input calibration shapes: model
[03/08/2022-00:33:53] [I] === System Options ===
[03/08/2022-00:33:53] [I] Device: 0
[03/08/2022-00:33:53] [I] DLACore: 0(With GPU fallback)
[03/08/2022-00:33:53] [I] Plugins:
[03/08/2022-00:33:53] [I] === Inference Options ===
[03/08/2022-00:33:53] [I] Batch: Explicit
[03/08/2022-00:33:53] [I] Input inference shapes: model
[03/08/2022-00:33:53] [I] Iterations: 10
[03/08/2022-00:33:53] [I] Duration: 3s (+ 200ms warm up)
[03/08/2022-00:33:53] [I] Sleep time: 0ms
[03/08/2022-00:33:53] [I] Streams: 1
[03/08/2022-00:33:53] [I] ExposeDMA: Disabled
[03/08/2022-00:33:53] [I] Data transfers: Enabled
[03/08/2022-00:33:53] [I] Spin-wait: Disabled
[03/08/2022-00:33:53] [I] Multithreading: Disabled
[03/08/2022-00:33:53] [I] CUDA Graph: Disabled
[03/08/2022-00:33:53] [I] Separate profiling: Disabled
[03/08/2022-00:33:53] [I] Time Deserialize: Disabled
[03/08/2022-00:33:53] [I] Time Refit: Disabled
[03/08/2022-00:33:53] [I] Skip inference: Disabled
[03/08/2022-00:33:53] [I] Inputs:
[03/08/2022-00:33:53] [I] === Reporting Options ===
[03/08/2022-00:33:53] [I] Verbose: Disabled
[03/08/2022-00:33:53] [I] Averages: 10 inferences
[03/08/2022-00:33:53] [I] Percentile: 99
[03/08/2022-00:33:53] [I] Dump refittable layers:Disabled
[03/08/2022-00:33:53] [I] Dump output: Disabled
[03/08/2022-00:33:53] [I] Profile: Disabled
[03/08/2022-00:33:53] [I] Export timing to JSON file:
[03/08/2022-00:33:53] [I] Export output to JSON file:
[03/08/2022-00:33:53] [I] Export profile to JSON file:
[03/08/2022-00:33:53] [I]
[03/08/2022-00:33:53] [I] === Device Information ===
[03/08/2022-00:33:53] [I] Selected Device: Xavier
[03/08/2022-00:33:53] [I] Compute Capability: 7.2
[03/08/2022-00:33:53] [I] SMs: 8
[03/08/2022-00:33:53] [I] Compute Clock Rate: 1.377 GHz
[03/08/2022-00:33:53] [I] Device Global Memory: 31920 MiB
[03/08/2022-00:33:53] [I] Shared Memory per SM: 96 KiB
[03/08/2022-00:33:53] [I] Memory Bus Width: 256 bits (ECC disabled)
[03/08/2022-00:33:53] [I] Memory Clock Rate: 1.377 GHz
[03/08/2022-00:33:53] [I]
[03/08/2022-00:33:53] [I] TensorRT version: 8001
[03/08/2022-00:33:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 9758 (MiB)
[03/08/2022-00:33:54] [I] Start parsing network model
[03/08/2022-00:33:54] [I] [TRT] ----------------------------------------------------------------
[03/08/2022-00:33:54] [I] [TRT] Input filename: ./fake_quantized_detnet.onnx
[03/08/2022-00:33:54] [I] [TRT] ONNX IR version: 0.0.6
[03/08/2022-00:33:54] [I] [TRT] Opset version: 13
[03/08/2022-00:33:54] [I] [TRT] Producer name: pytorch
[03/08/2022-00:33:54] [I] [TRT] Producer version: 1.9
[03/08/2022-00:33:54] [I] [TRT] Domain:
[03/08/2022-00:33:54] [I] [TRT] Model version: 0
[03/08/2022-00:33:54] [I] [TRT] Doc string:
[03/08/2022-00:33:54] [I] [TRT] ----------------------------------------------------------------
[03/08/2022-00:33:54] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[03/08/2022-00:33:54] [I] Finish parsing network model
[03/08/2022-00:33:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 390, GPU 9810 (MiB)
[03/08/2022-00:33:54] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
...
[03/08/2022-00:36:00] [I] [TRT] Total Host Persistent Memory: 52096
[03/08/2022-00:36:00] [I] [TRT] Total Device Persistent Memory: 17249280
[03/08/2022-00:36:00] [I] [TRT] Total Scratch Memory: 0
[03/08/2022-00:36:00] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 20 MiB, GPU 613 MiB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1399, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1399, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1399, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1398, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 1383 MiB, GPU 11291 MiB
[03/08/2022-00:36:00] [I] [TRT] Loaded engine size: 17 MB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 1411 MiB, GPU 11304 MiB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1411, GPU 11304 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1411, GPU 11304 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1411, GPU 11304 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1411 MiB, GPU 11304 MiB
[03/08/2022-00:36:00] [I] Engine built in 127.321 sec.
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1393 MiB, GPU 11287 MiB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1393, GPU 11287 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1393, GPU 11287 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1393 MiB, GPU 11288 MiB
[03/08/2022-00:36:01] [I] Created input binding for xyz with dimensions 1x9x960x400
[03/08/2022-00:36:01] [I] Created input binding for radar with dimensions 1x5x960x400
[03/08/2022-00:36:01] [I] Created input binding for rgb with dimensions 1x10x960x400
[03/08/2022-00:36:01] [I] Created output binding for conv5 with dimensions 1x96000x7x2x21
[03/08/2022-00:36:01] [I] Starting inference
[03/08/2022-00:36:04] [I] Warmup completed 4 queries over 200 ms
[03/08/2022-00:36:04] [I] Timing trace has 48 queries over 3.08892 s
[03/08/2022-00:36:04] [I]
[03/08/2022-00:36:04] [I] === Trace details ===
[03/08/2022-00:36:04] [I] Trace averages of 10 runs:
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.6344 ms - Host latency: 64.1485 ms (end to end 64.1578 ms, enqueue 1.49861 ms)
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.9732 ms - Host latency: 64.5624 ms (end to end 64.573 ms, enqueue 1.53205 ms)
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.8073 ms - Host latency: 64.3632 ms (end to end 64.3732 ms, enqueue 1.5212 ms)
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.8378 ms - Host latency: 64.3892 ms (end to end 64.3978 ms, enqueue 1.4762 ms)
[03/08/2022-00:36:04] [I]
[03/08/2022-00:36:04] [I] === Performance summary ===
[03/08/2022-00:36:04] [I] Throughput: 15.5394 qps
[03/08/2022-00:36:04] [I] Latency: min = 63.7902 ms, max = 64.8762 ms, mean = 64.3429 ms, median = 64.3541 ms, percentile(99%) = 64.8762 ms
[03/08/2022-00:36:04] [I] End-to-End Host Latency: min = 63.7978 ms, max = 64.8853 ms, mean = 64.3525 ms, median = 64.3644 ms, percentile(99%) = 64.8853 ms
[03/08/2022-00:36:04] [I] Enqueue Time: min = 1.42554 ms, max = 1.82007 ms, mean = 1.51326 ms, median = 1.5011 ms, percentile(99%) = 1.82007 ms
[03/08/2022-00:36:04] [I] H2D Latency: min = 0.98291 ms, max = 1.15161 ms, mean = 1.00438 ms, median = 1.00171 ms, percentile(99%) = 1.15161 ms
[03/08/2022-00:36:04] [I] GPU Compute Time: min = 59.3004 ms, max = 60.2296 ms, mean = 59.7992 ms, median = 59.8034 ms, percentile(99%) = 60.2296 ms
[03/08/2022-00:36:04] [I] D2H Latency: min = 3.01245 ms, max = 3.61584 ms, mean = 3.53929 ms, median = 3.54633 ms, percentile(99%) = 3.61584 ms
[03/08/2022-00:36:04] [I] Total Host Walltime: 3.08892 s
[03/08/2022-00:36:04] [I] Total GPU Compute Time: 2.87036 s
[03/08/2022-00:36:04] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/08/2022-00:36:04] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=./fake_quantized_detnet.onnx --int8 --useDLACore=0 --allowGPUFallback
[03/08/2022-00:36:04] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1393, GPU 11288 (MiB)
Thanks.