Dear community,
In order to optimize a semantic segmentation model running on Jetson Orin NX, I am interested in Post-Training Quantization (PTQ). The model I work on is a BiSeNet V2 trained on Cityscapes in which I am trying to convert the layers from FP32 to INT8 using TensorRT’s conversion tool based on class trt.IInt8EntropyCalibrator2
. I am aiming for full quantization in INT8, but some layers visibly cannot be converted.
So far, I have only obtained partial quantization with some layers selectively left in FP16 (since they visibly cannot be converted to INT8) and others successfully converted to INT8 as you can see on the pictures below generated with “TREx”. Note that I have used the ONNX model simplification library “onnx-simplifier
” for increasing the number of convertible layers.
As you can see in the second of the following two images, the BiSeNet input is FP32, the output FP16 and 25 layers are FP16 scattered around the middle of the network:
- Quantized BiSeNet V2:
- Standard BiSeNet V2:
As reference, here is the ONNX model, command and function I use for quantization:
- BiSeNet V2 ONNX:
simplified.zip (12.0 MB)
- Command: python production/src/production/bin/conversion/run_quantization.py --int8 --calibration-data=/media/10TBHardDisk25/Datasets/cityscapes_train_for_calib/train/ --calibration-cache=/mnt/erx/caches/cityscapes_train_calib.cache --explicit-batch -m /media/10TBHardDisk25/BiSeNetV2Tests/simplified.onnx
.
- Function: build_engine_bisenet()
of the Python script below:
build_engine_bisenet.zip (1.7 KB)
This incomplete quantization has an impact on the model’s inference time (5.6 [ms] for the quantized model vs. 4.6 [ms] for the standard model). Indeed, not only does the quantized model contain 109 layers versus 86 for the standard model:
But, as you can also see in the figures below generated with “TREx”, this “mixed-precision” approach introduces “Reformat” layers with a higher overall latency than the standard model:
These “Reformat Layers” are apparently typically inserted to handle tensor precision conversions and seem to introduce additional latency (increased memory access and computational overhead) thus increasing the overall latency of the model which counterbalances the performance gain expected from quantization.
I believe that this non-complete INT8 network phenomenon may be due to the Average Pooling that TensorRT fails to quantize. Indeed, semantic segmentation models usually have a lot of “Average Pooling” and in the case of BiSeNet V2 quantization I had to force their type to trt.DataType.HALF
.
I would be interested to know whether, in your experience with post-training quantization in INT8, you have also encountered situations where some of the layers cannot be converted. In addition, what would be your approach to still benefit from this quantization stage.
Thank you very much for your precious insight!
TensorRT version: 8.5.2.2
CUDA version: 11.8
Jetpack version: 5.1.1
PyTorch version: 2.1.1
Operating System + version: Ubuntu 20.04
Python version: 3.8.10