Engine creation fails when using DLA with GPU fallback

Hello,

I am trying to use trtexec to build an engine from an ONNX file, using DLA with GPU fallback.
The process fails with

[E] Error[2]: [dlaNode.cpp::validateGraphNode::595] Error Code 2: Internal Error (Assertion node->inputs.size() == 1 failed.)
[02/24/2022-11:58:44] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
[02/24/2022-11:58:44] [E] Engine could not be created from network
[02/24/2022-11:58:44] [E] Building engine failed
[02/24/2022-11:58:44] [E] Failed to create engine from model.
[02/24/2022-11:58:44] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8201] # ./trtexec --onnx=../networks/fake_quantized_detnet.onnx --int8 --fp16 --useDLACore=1 --allowGPUFallback

I am using TensorRT v8.2.

Hi,

Would you mind trying if the conversion works on the GPU mode?
Thanks.

Already tried. It works fine using GPU only. It’s when using DLA that it fails.

Hi,

Could you share the ONNX file with us to check?
Thanks

Sure, here it is!

fake_quantized_detnet.onnx (16.7 MB)

Hi,

Which JetPack version do you use?
It seems that you have already updated the TensorRT into 8.2.1.

We test the model on TensorRT v8.0.1 included in the JetPack4.6.
It can work without issue on DLA mode.

$ /usr/src/tensorrt/bin/trtexec --onnx=./fake_quantized_detnet.onnx --int8 --useDLACore=0 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=./fake_quantized_detnet.onnx --int8 --useDLACore=0 --allowGPUFallback
[03/08/2022-00:33:53] [I] === Model Options ===
[03/08/2022-00:33:53] [I] Format: ONNX
[03/08/2022-00:33:53] [I] Model: ./fake_quantized_detnet.onnx
[03/08/2022-00:33:53] [I] Output:
[03/08/2022-00:33:53] [I] === Build Options ===
[03/08/2022-00:33:53] [I] Max batch: explicit
[03/08/2022-00:33:53] [I] Workspace: 16 MiB
[03/08/2022-00:33:53] [I] minTiming: 1
[03/08/2022-00:33:53] [I] avgTiming: 8
[03/08/2022-00:33:53] [I] Precision: FP32+INT8
[03/08/2022-00:33:53] [I] Calibration: Dynamic
[03/08/2022-00:33:53] [I] Refit: Disabled
[03/08/2022-00:33:53] [I] Sparsity: Disabled
[03/08/2022-00:33:53] [I] Safe mode: Disabled
[03/08/2022-00:33:53] [I] Restricted mode: Disabled
[03/08/2022-00:33:53] [I] Save engine:
[03/08/2022-00:33:53] [I] Load engine:
[03/08/2022-00:33:53] [I] NVTX verbosity: 0
[03/08/2022-00:33:53] [I] Tactic sources: Using default tactic sources
[03/08/2022-00:33:53] [I] timingCacheMode: local
[03/08/2022-00:33:53] [I] timingCacheFile:
[03/08/2022-00:33:53] [I] Input(s)s format: fp32:CHW
[03/08/2022-00:33:53] [I] Output(s)s format: fp32:CHW
[03/08/2022-00:33:53] [I] Input build shapes: model
[03/08/2022-00:33:53] [I] Input calibration shapes: model
[03/08/2022-00:33:53] [I] === System Options ===
[03/08/2022-00:33:53] [I] Device: 0
[03/08/2022-00:33:53] [I] DLACore: 0(With GPU fallback)
[03/08/2022-00:33:53] [I] Plugins:
[03/08/2022-00:33:53] [I] === Inference Options ===
[03/08/2022-00:33:53] [I] Batch: Explicit
[03/08/2022-00:33:53] [I] Input inference shapes: model
[03/08/2022-00:33:53] [I] Iterations: 10
[03/08/2022-00:33:53] [I] Duration: 3s (+ 200ms warm up)
[03/08/2022-00:33:53] [I] Sleep time: 0ms
[03/08/2022-00:33:53] [I] Streams: 1
[03/08/2022-00:33:53] [I] ExposeDMA: Disabled
[03/08/2022-00:33:53] [I] Data transfers: Enabled
[03/08/2022-00:33:53] [I] Spin-wait: Disabled
[03/08/2022-00:33:53] [I] Multithreading: Disabled
[03/08/2022-00:33:53] [I] CUDA Graph: Disabled
[03/08/2022-00:33:53] [I] Separate profiling: Disabled
[03/08/2022-00:33:53] [I] Time Deserialize: Disabled
[03/08/2022-00:33:53] [I] Time Refit: Disabled
[03/08/2022-00:33:53] [I] Skip inference: Disabled
[03/08/2022-00:33:53] [I] Inputs:
[03/08/2022-00:33:53] [I] === Reporting Options ===
[03/08/2022-00:33:53] [I] Verbose: Disabled
[03/08/2022-00:33:53] [I] Averages: 10 inferences
[03/08/2022-00:33:53] [I] Percentile: 99
[03/08/2022-00:33:53] [I] Dump refittable layers:Disabled
[03/08/2022-00:33:53] [I] Dump output: Disabled
[03/08/2022-00:33:53] [I] Profile: Disabled
[03/08/2022-00:33:53] [I] Export timing to JSON file:
[03/08/2022-00:33:53] [I] Export output to JSON file:
[03/08/2022-00:33:53] [I] Export profile to JSON file:
[03/08/2022-00:33:53] [I]
[03/08/2022-00:33:53] [I] === Device Information ===
[03/08/2022-00:33:53] [I] Selected Device: Xavier
[03/08/2022-00:33:53] [I] Compute Capability: 7.2
[03/08/2022-00:33:53] [I] SMs: 8
[03/08/2022-00:33:53] [I] Compute Clock Rate: 1.377 GHz
[03/08/2022-00:33:53] [I] Device Global Memory: 31920 MiB
[03/08/2022-00:33:53] [I] Shared Memory per SM: 96 KiB
[03/08/2022-00:33:53] [I] Memory Bus Width: 256 bits (ECC disabled)
[03/08/2022-00:33:53] [I] Memory Clock Rate: 1.377 GHz
[03/08/2022-00:33:53] [I]
[03/08/2022-00:33:53] [I] TensorRT version: 8001
[03/08/2022-00:33:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 9758 (MiB)
[03/08/2022-00:33:54] [I] Start parsing network model
[03/08/2022-00:33:54] [I] [TRT] ----------------------------------------------------------------
[03/08/2022-00:33:54] [I] [TRT] Input filename:   ./fake_quantized_detnet.onnx
[03/08/2022-00:33:54] [I] [TRT] ONNX IR version:  0.0.6
[03/08/2022-00:33:54] [I] [TRT] Opset version:    13
[03/08/2022-00:33:54] [I] [TRT] Producer name:    pytorch
[03/08/2022-00:33:54] [I] [TRT] Producer version: 1.9
[03/08/2022-00:33:54] [I] [TRT] Domain:
[03/08/2022-00:33:54] [I] [TRT] Model version:    0
[03/08/2022-00:33:54] [I] [TRT] Doc string:
[03/08/2022-00:33:54] [I] [TRT] ----------------------------------------------------------------
[03/08/2022-00:33:54] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[03/08/2022-00:33:54] [I] Finish parsing network model
[03/08/2022-00:33:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 390, GPU 9810 (MiB)
[03/08/2022-00:33:54] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
...
[03/08/2022-00:36:00] [I] [TRT] Total Host Persistent Memory: 52096
[03/08/2022-00:36:00] [I] [TRT] Total Device Persistent Memory: 17249280
[03/08/2022-00:36:00] [I] [TRT] Total Scratch Memory: 0
[03/08/2022-00:36:00] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 20 MiB, GPU 613 MiB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1399, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1399, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1399, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1398, GPU 11291 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 1383 MiB, GPU 11291 MiB
[03/08/2022-00:36:00] [I] [TRT] Loaded engine size: 17 MB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 1411 MiB, GPU 11304 MiB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1411, GPU 11304 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1411, GPU 11304 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1411, GPU 11304 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1411 MiB, GPU 11304 MiB
[03/08/2022-00:36:00] [I] Engine built in 127.321 sec.
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1393 MiB, GPU 11287 MiB
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1393, GPU 11287 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1393, GPU 11287 (MiB)
[03/08/2022-00:36:00] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1393 MiB, GPU 11288 MiB
[03/08/2022-00:36:01] [I] Created input binding for xyz with dimensions 1x9x960x400
[03/08/2022-00:36:01] [I] Created input binding for radar with dimensions 1x5x960x400
[03/08/2022-00:36:01] [I] Created input binding for rgb with dimensions 1x10x960x400
[03/08/2022-00:36:01] [I] Created output binding for conv5 with dimensions 1x96000x7x2x21
[03/08/2022-00:36:01] [I] Starting inference
[03/08/2022-00:36:04] [I] Warmup completed 4 queries over 200 ms
[03/08/2022-00:36:04] [I] Timing trace has 48 queries over 3.08892 s
[03/08/2022-00:36:04] [I]
[03/08/2022-00:36:04] [I] === Trace details ===
[03/08/2022-00:36:04] [I] Trace averages of 10 runs:
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.6344 ms - Host latency: 64.1485 ms (end to end 64.1578 ms, enqueue 1.49861 ms)
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.9732 ms - Host latency: 64.5624 ms (end to end 64.573 ms, enqueue 1.53205 ms)
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.8073 ms - Host latency: 64.3632 ms (end to end 64.3732 ms, enqueue 1.5212 ms)
[03/08/2022-00:36:04] [I] Average on 10 runs - GPU latency: 59.8378 ms - Host latency: 64.3892 ms (end to end 64.3978 ms, enqueue 1.4762 ms)
[03/08/2022-00:36:04] [I]
[03/08/2022-00:36:04] [I] === Performance summary ===
[03/08/2022-00:36:04] [I] Throughput: 15.5394 qps
[03/08/2022-00:36:04] [I] Latency: min = 63.7902 ms, max = 64.8762 ms, mean = 64.3429 ms, median = 64.3541 ms, percentile(99%) = 64.8762 ms
[03/08/2022-00:36:04] [I] End-to-End Host Latency: min = 63.7978 ms, max = 64.8853 ms, mean = 64.3525 ms, median = 64.3644 ms, percentile(99%) = 64.8853 ms
[03/08/2022-00:36:04] [I] Enqueue Time: min = 1.42554 ms, max = 1.82007 ms, mean = 1.51326 ms, median = 1.5011 ms, percentile(99%) = 1.82007 ms
[03/08/2022-00:36:04] [I] H2D Latency: min = 0.98291 ms, max = 1.15161 ms, mean = 1.00438 ms, median = 1.00171 ms, percentile(99%) = 1.15161 ms
[03/08/2022-00:36:04] [I] GPU Compute Time: min = 59.3004 ms, max = 60.2296 ms, mean = 59.7992 ms, median = 59.8034 ms, percentile(99%) = 60.2296 ms
[03/08/2022-00:36:04] [I] D2H Latency: min = 3.01245 ms, max = 3.61584 ms, mean = 3.53929 ms, median = 3.54633 ms, percentile(99%) = 3.61584 ms
[03/08/2022-00:36:04] [I] Total Host Walltime: 3.08892 s
[03/08/2022-00:36:04] [I] Total GPU Compute Time: 2.87036 s
[03/08/2022-00:36:04] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/08/2022-00:36:04] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=./fake_quantized_detnet.onnx --int8 --useDLACore=0 --allowGPUFallback
[03/08/2022-00:36:04] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1393, GPU 11288 (MiB)

Thanks.

I am using JetPack 4.6 with Tensor RT 8.2 built from source.
Did you try it with TRT 8.2? We need to use this particular version.

Hi,

Please noted that we only release the source of the plugin library.
So you should have the v8.0 core library along with the v8.2 plugin library.
This might be some issues since the API is not consistent.

We just release the new JetPack4.6.1 which includes TensorRT v8.2.
It’s more recommended to use this software instead.

But it seems that your issue is caused by the incorrect input argument.
In the original command, you feed both int8 and fp16 to the trtexec binary.
Have you tried to remove the fp16 flag to see if it works?

Thanks.

I tried removing the fp16 flag and it works.
But why is using both flags incorrect? trtexec has a “best” parameter which does exactly this. Not all layers are quantized, so the fp32 layers need to be converted to fp16 for better performance.
I will try to use JetPack 4.6.1.

Hi,

It looks like the constraint come from your model.
There are some layers that are quantized into INT8 mode so you cannot deploy all the layers into fp16 mode.
However, using the best mode (fp16+int8) is possible.

When you feed multiple precision flags, trtexec will use the last one according to its parsing rules.
Thanks.

I understand the constraints, but, in my opinion, it still should fallback to GPU, not refuse to build the engine.
You mentioned the best mode (fp16+int8) should work, but it doesn’t on DLA.
So far the only way to build the engine for DLA is using only int8. But this would not have maximum performance, since layers that are not quantized would run in FP32.
I’m just trying to understand the reason why it fails instead of falling back to GPU.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.