Model quantized with explicit precision mode (with Q/DQ nodes) failed in engine generation

Description

Here is the ONNX model I used to generate the engine: model

It is quantized with torch_quantization toolkit, following the most simple instructions that was given (Using quant_modules.initialize() to automatically replace all the supported layers, no manual Q/DQ placement), and calibrated following the same code provided here.

After the quantization, I exported the ONNX file with opset13 and pytorch_nightly (only nightly can do it for ops supporting reasons), it was successful and I used ONNX_simplifier to further simplify it, and it was successful.

According to the documentation, the explicitly quantized model can be directly compiled to an engine without further configuration. But I tried both trtexec with --int8 flag on and a custom engine building script (following the most basic example provided by the documentation here), both failed with this error message:

[05/05/2022-01:44:01] [TRT] [V] Running: ConstWeightsQuantizeFusion
[05/05/2022-01:44:01] [TRT] [V] ConstWeightsQuantizeFusion: Fusing update_block.flow_head.conv1.weight with QuantizeLinear_1641_quantize_scale_node
[05/05/2022-01:44:01] [TRT] [V] Running: ConstWeightsQuantizeFusion
[05/05/2022-01:44:01] [TRT] [V] ConstWeightsQuantizeFusion: Fusing update_block.flow_head.conv2.weight with QuantizeLinear_1648_quantize_scale_node
[05/05/2022-01:44:01] [TRT] [V] Running: VanillaSwapWithFollowingQ
[05/05/2022-01:44:01] [TRT] [V] Swapping Relu_631 with QuantizeLinear_652_quantize_scale_node
[05/05/2022-01:44:01] [TRT] [V] Running: SplitQAcrossPrecedingFanIn
[05/05/2022-01:44:01] [TRT] [V] Running: SplitQAcrossPrecedingFanIn
[05/05/2022-01:44:02] [TRT] [V] Running: SplitQAcrossPrecedingFanIn
[05/05/2022-01:44:02] [TRT] [V] Running: SplitQAcrossPrecedingFanIn
[05/05/2022-01:44:02] [TRT] [V] Running: SplitQAcrossPrecedingFanIn
[05/05/2022-01:44:02] [TRT] [E] 2: [checkSanity.cpp::checkSanity::106] Error Code 2: Internal Error (Assertion regionNames.find(r->name) == regionNames.end() failed. Found duplicate region name onnx::Concat_1406_clone_1)
[05/05/2022-01:44:02] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::619] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

It sounds like there is a duplicated node in the graph, but I failed to find it as I search through the ONNX model. Please help me with it, thanks!

PS: The non-quantized version of the model can be successfully compiled to an engine without any issue.

Environment

TensorRT Version: 8.4 GA
GPU Type: NVIDIA GTX 3090
Nvidia Driver Version: 11.6
CUDA Version: 11.5
CUDNN Version: 11.5
Operating System + Version: Ubuntu 20.04 LT
Python Version (if applicable): 3.8.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.12 (Nightly build as May 5, 2022)
Baremetal or Container (if container which image + tag):

Relevant Files

ONNX file

Steps To Reproduce

Just run trtexec with following arguments will give you the error:

trtexec --onnx=onnx_dir --saveEngine='engine_dir ’ --workspace=4096 --int8 --fp16 --noTF32 --verbose --noDataTransfers --separateProfileRun --dumpProfile --useCudaGraph > ‘log_dir’

Hi, Please refer to the below links to perform inference in INT8

Thanks!

To add to this problem, I also tried quantization through TensorRT with a custom calibrator, although it is not stuck on the same node, but it gives the same assertion failure error message:

Completed parsing of ONNX file
Building an engine…
[05/05/2022-14:30:33] [TRT] [I] MatMul_707: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_709: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_711: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_760: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_762: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_764: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_732: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_785: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_745: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,512][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_798: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,512][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_747: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_800: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_849: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_851: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_853: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_904: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_876: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_889: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,512][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_891: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_906: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_908: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_931: broadcasting input1 to make tensors conform, dims(input0)=[4,880,256][NONE] dims(input1)=[1,256,256][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_944: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,512][NONE].
[05/05/2022-14:30:33] [TRT] [I] MatMul_946: broadcasting input1 to make tensors conform, dims(input0)=[4,880,512][NONE] dims(input1)=[1,512,256][NONE].
[05/05/2022-14:30:34] [TRT] [V] Original: 2842 layers
[05/05/2022-14:30:34] [TRT] [V] After dead-layer removal: 2842 layers
[05/05/2022-14:30:34] [TRT] [E] 2: [checkSanity.cpp::checkSanity::106] Error Code 2: Internal Error (Assertion regionNames.find(r->name) == regionNames.end() failed. Found duplicate region name (Unnamed Layer* 435) [Constant]_output’)
[05/05/2022-14:30:34] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::619] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Completed creating Engine
Traceback (most recent call last):
File “build_eg.py”, line 7, in
f.write(serialized_engine)
TypeError: a bytes-like object is required, not ‘NoneType’

Here is the non-quantizatized ONNX model

Here is the calibrator I used, it is just a mild modification from the official sample: script

model building: script

Thank you for your prompt response. I have read those materials multiple times already before I tried it myself. There aren’t many thorough materials or examples about the whole pipeline of TensorRT based quantization or pytorch_quantization → TensorRT engine compiling in the first place.

It will be great if you could take a look at the raised issue and share with me some hints of how to deal with it. Thanks!

Hi,

We could reproduce the same error. This looks like a known issue. Which will be fixed in the future release.

Thank you.

any updates regarding this ?

This is a concerning issue, as model quantized with explicit precision mode (with Q/DQ nodes) is a key component of many machine learning applications. It is essential for efficient inference, where low precision models can offer significant performance improvements over higher precision models. Unfortunately, it appears that engine generation is failing when using this approach. This could be due to a number of factors, such as incorrect quantization parameters, inadequate hardware support, or the need for additional post-processing. It is important to carefully investigate the root cause of this issue before attempting to deploy the model. In addition, it may be necessary to switch to a different quantization approach, such as dynamic quantization or integer quantization, to ensure that engine generation is successful. Ultimately, this issue highlights the importance of properly testing and validating any machine learning models before deployment.