Hello @AastaLLL,
I am sorry for the late reply.
Thank you for the precision regarding the fact that TRT is optimized based on performance, this would make sense.
Here are a few feedbacks on my experiences:
Experiment 1
I finally tested and compared the two quantization experiments based on TRT 8.5.2 (with and without data type constraints):
A) With constraints:
- As you can see in the provided script
run_quantization.zip (5.8 KB)
, here is the list of layers I was forced to set the data type to HALF
in order to avoid error Error Code 10: Internal Error (Could not find any implementation for node ...
:
- “/Resize_1”,
- “/backbone/semantic/stage1/pool/MaxPool”
- “/backbone/semantic/stage2/0/dwconv/0/conv/Conv”
- “/backbone/bga/detail_down/1/AveragePool”
- “/backbone/semantic/stage2/1/dwconv/0/conv/Conv”
- “/backbone/semantic/stage2/1/dwconv/0/activate/Relu”
- “/backbone/semantic/stage3/0/dwconv/0/conv/Conv”
- “/backbone/semantic/stage3/1/dwconv/0/conv/Conv”
- “/backbone/semantic/stage3/1/dwconv/0/activate/Relu”
- “/backbone/semantic/stage4/0/dwconv/0/conv/Conv”
- “/backbone/semantic/stage4/1/dwconv/0/conv/Conv”
- “/backbone/semantic/stage4/1/dwconv/0/activate/Relu”
- “/backbone/semantic/stage4/2/dwconv/0/conv/Conv”
- “/backbone/semantic/stage4/2/dwconv/0/activate/Relu”
- “/backbone/semantic/stage4/3/dwconv/0/conv/Conv”
- “/backbone/semantic/stage4/3/dwconv/0/activate/Relu”
- “/backbone/semantic/stage4_CEBlock/gap/0/GlobalAveragePool”
- Precision statistics: ~ 3/4 INT8, ~ 1/4 FP16, 1 layer in FP32.
- Complete
trex
report here:
sima_simplified_with_constraints.engine.zip (3.3 MB)
B) Without constraints:
In both cases, I used the command python run_quantization.py --int8 --verbose=DEBUG --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch -m /model/bisenetv2_simplified.onnx
.
Comparison of both engine models realized with trex
script called “compare_engines.ipynb”:
Conclusion:
- The data type of layer “/backbone/bga/Concat_output_0” has to be set to INT32 in any case. Otherwise the script can’t continue building the engine.
- Letting TensorRT choose the data type of layers doesn’t allow to get a complete INT8 quantization. Forcing data types to
HALF
for some layers obviously doesn’t contribute to complete quantization either, but at least we see that using contraints we tend to slightly increase the number of INT8 layers.
- Even if with constraints we have more INT8 layers, the overall latency of the engine is greater than the one of the engine without constraints. Also, when constraining the data types, more “Reformat” layers appear. This contributes to the greater overall latency.
Experiment 2
I also investigated TRT 10.3.0. In this experiment, I reused the exact same script made for TRT 8.5.2 and just slightly adapted some parts to make it compatible with TRT 10.3.0 instead. Here are the two adapted snippets that you can find in the provided script
run_quantization.zip (6.8 KB)
as well:
- Snippet 1:
#"strict_types": trt.BuilderFlag.STRICT_TYPES,
# --- Adaptation for TensorRT 10.3.0
"strict_types": trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS,
- Snippet 2:
#config.max_workspace_size = 2**30 # 1GiB
# --- Adaptation for TensorRT 10.3.0
config.set_memory_pool_limit(
trt.MemoryPoolType.WORKSPACE, 2**30
)
Here again, I used the command python run_quantization.py --int8 --verbose=DEBUG --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch -m /model/bisenetv2_simplified.onnx
.
Unfortunately, with TRT 10.3.0, I couldn’t let the program choose the best data types. In fact, I had to successively force the type of some layers to avoid Error Code 10: Internal Error (Could not find any implementation for node ...
. This way, I could bypass the issue by setting the data type of layers “/backbone/detail/detail_branch.2/0/conv/Conv”, “/backbone/semantic/stage1/convs/0/conv/Conv” and “/backbone/semantic/stage1/convs/1/conv/Conv” to BF16
as you can see in the provided script. Sadly, when reaching layer “/backbone/semantic/stage1/pool/MaxPool”, I individually tested each data types listed in the documentation here, but always got following error when trying to build the quantized engine:
[TRT] [E] IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node /backbone/semantic/stage1/pool/MaxPool.)
Conclusion: Whereas there is no issue with TRT 8.5.2, TRT 10.3.0 apparently can’t find any implementation for the MaxPool
layer.
Experiment 3
As you recommended since implicit quantization is deprecated, I finally also tried explicit quantization by keeping the same script but changing the command by using --explicit-precision
flag (since I think it is the only way to do explicit quantization without changing my current script). The command I hence used here was python run_quantization.py --int8 --verbose=DEBUG --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch --explicit-precision -m /model/bisenetv2_simplified.onnx
.
I tested this only in the case of TRT 8.5.2 and no type constraints. However, I noticed I also had to set layer “/backbone/bga/Concat_output_0” to INT32
. This seems to be always necessary in order to avoid following issue:
[TRT] [E] 3: /backbone/bga/Concat_output_0: cannot use precision Int8 with weights of type Int32
[TRT] [E] 4: [network.cpp::validate::3015] Error Code 4: Internal Error (Layer /backbone/bga/Concat_output_0 failed validation)
This way, I could generate an engine but observed there is no difference between the engine built with and without flag --explicit-precision
:
Conclusion: In case of TRT 10.3.0, using flag --explicit-precision
doesn’t work since EXPLICIT_PRECISION
is deprecated. In fact, TensorRT now automatically infers precision based on the provided calibration scales and user-defined precision constraints. Alternatively, I tried to do explicit quantization with TRT 10.3.0 using “explicit quantization” mode by specifying scales for all tensors in the network. This unfornately didn’t help to generate an engine model.
Overall conclusion
- It seems that intentionally constraining some layer types is necessary to build the engine (at least 1 for TRT 8.5.2 and apparently minimum 4 for TRT 10.3.0 were it blocked at layer “/backbone/semantic/stage1/pool/MaxPool”).
- Engine built with constraints present slightly more INT8 layers but a greater latency (due to the introduction of “Reformat” layers).
- Explicit quantization (at least using the method of adding flag
--explicit-precision
) didn’t bring any improvements in terms of quantization.
To further my experiments and test the limits of complete INT8 quantization with TensorRT, I will now explore two options:
- Execute same script with a much simpler model architecture.
- Change strategy and leverage TensorRT’s Quantization Toolkit PyTorch library (which documentation is also available here) for doing explicit quantization.
I hope this feedback will also be useful to you. Thank you again for your help and patience!
Note: In this reply, I just joined the very scripts for running quantization. The other helper and calibrator scripts remained unchanged in comparison to what I sent in my previous posts above.