Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX

Thoan2614 · December 11, 2024, 2:49pm

Dear community,

In order to optimize a semantic segmentation model running on Jetson Orin NX, I am interested in Post-Training Quantization (PTQ). The model I work on is a BiSeNet V2 trained on Cityscapes in which I am trying to convert the layers from FP32 to INT8 using TensorRT’s conversion tool based on class trt.IInt8EntropyCalibrator2. I am aiming for full quantization in INT8, but some layers visibly cannot be converted.

So far, I have only obtained partial quantization with some layers selectively left in FP16 (since they visibly cannot be converted to INT8) and others successfully converted to INT8 as you can see on the pictures below generated with “TREx”. Note that I have used the ONNX model simplification library “onnx-simplifier” for increasing the number of convertible layers.

As you can see in the second of the following two images, the BiSeNet input is FP32, the output FP16 and 25 layers are FP16 scattered around the middle of the network:
- Quantized BiSeNet V2:

- Standard BiSeNet V2:

As reference, here is the ONNX model, command and function I use for quantization:
- BiSeNet V2 ONNX:
simplified.zip (12.0 MB)
- Command: python production/src/production/bin/conversion/run_quantization.py --int8 --calibration-data=/media/10TBHardDisk25/Datasets/cityscapes_train_for_calib/train/ --calibration-cache=/mnt/erx/caches/cityscapes_train_calib.cache --explicit-batch -m /media/10TBHardDisk25/BiSeNetV2Tests/simplified.onnx.
- Function: build_engine_bisenet() of the Python script below:
build_engine_bisenet.zip (1.7 KB)

This incomplete quantization has an impact on the model’s inference time (5.6 [ms] for the quantized model vs. 4.6 [ms] for the standard model). Indeed, not only does the quantized model contain 109 layers versus 86 for the standard model:

But, as you can also see in the figures below generated with “TREx”, this “mixed-precision” approach introduces “Reformat” layers with a higher overall latency than the standard model:

These “Reformat Layers” are apparently typically inserted to handle tensor precision conversions and seem to introduce additional latency (increased memory access and computational overhead) thus increasing the overall latency of the model which counterbalances the performance gain expected from quantization.

I believe that this non-complete INT8 network phenomenon may be due to the Average Pooling that TensorRT fails to quantize. Indeed, semantic segmentation models usually have a lot of “Average Pooling” and in the case of BiSeNet V2 quantization I had to force their type to trt.DataType.HALF.

I would be interested to know whether, in your experience with post-training quantization in INT8, you have also encountered situations where some of the layers cannot be converted. In addition, what would be your approach to still benefit from this quantization stage.

Thank you very much for your precious insight!

TensorRT version: 8.5.2.2
CUDA version: 11.8
Jetpack version: 5.1.1
PyTorch version: 2.1.1
Operating System + version: Ubuntu 20.04
Python version: 3.8.10

AastaLLL · December 12, 2024, 4:57am

Hi,

How do you apply the quantization to the model?
Do you do it manually with API?

If so, have you tried to do that with trtexec?

Thanks.

Thoan2614 · December 12, 2024, 2:47pm

Hello @AastaLLL ,

I don’t use trtexec. I am effectively using the API to quantize the model.

More precisely, I first generate a calibration cache with a function returning an object deriving from class trt.IInt8EntropyCalibrator2 (this is done in the function handle_precision_flags() as you can see in my script build_engine_bisenet.py shared in a ZIP file above). After that I build the quantized model with serialized_engine = builder.build_serialized_network(network, config).

From my experience, I have already used trtexec but only for converting ONNX to “standard” .engine without quantization. If you think using trtexec for quantization is better than using the API, could you please point me to the relevant documentation? Do you refer to “Serialized Engine Generation”?

Thank you very much in advance! 🙏

AastaLLL · December 16, 2024, 6:50am

Hi,

The two approaches are expected to generate the same output.
Based on your description, since there are some non-quantized layers in between, TensorRT needs to add some reformat handler to deal with the precision changes that can impact the perf.

We need to give it a try to gather more info.
Will keep you updated.

Thanks.

AastaLLL · December 16, 2024, 8:18am

Hi,

Would you mind sharing the runnable building script and analysis source so we can check it further in our environment?

Thanks.

Thoan2614 · December 18, 2024, 4:05pm

Hi @AastaLLL

Of course, you can find my code here:
2024-12-18_ScriptsSharedWithNVIDIA.zip (20.5 MB)

The script run_quantization.py launches the quantization. This zipped folder also contains the model as well as the analysis report generated with trex.

Don’t hesitate to come back to me if you have any trouble getting the code to run.

Thank you again very much for your support!

AastaLLL · December 19, 2024, 5:48am

Hi,

Thanks for sharing this.
We will give it a try and let you know the following.

Thanks.

AastaLLL · December 23, 2024, 8:08am

Hi,

Thanks for sharing the script.

As we are checking locally, it’s possible for you to upgrade the JetPack version to our latest v6.1?
Since the latest software usually contains several bug fixes and new features so the behavior might be different.

Thanks.

Thoan2614 · December 23, 2024, 9:23am

Hi,

Thank you for the recommendation, but unfortunately not. That’s precisely the limitation of my code: I have to work with a Jetson Orin NX installed on a machine that is limited to JetPack v5.1. But yes, I’m very aware of the fact that staying stuck at JetPack v5.1 can be the source of many bugs that have subsequently been fixed.

Thank you again for your time.

AastaLLL · December 24, 2024, 5:33am

Hi,

Understood, we need to check this with our internal team and provide more info to you later.
Our internal team usually works on the issue with the latest release.

Once we get further info (there might be some delay due to the holiday season) we can see if it is possible to apply to the JetPack 5.1.

Thanks.

Thoan2614 · December 24, 2024, 8:42am

Hi @AastaLLL

That’s interesting that some layers can’t be converted to INT8 even with the latest version of TensorRT 🤔. Thank you for your feedback and for testing 🙏.

Don’t worry, I won’t be actively working for the next 2 weeks either. I look forward to hearing from you.

Thank you again for your efforts and Merry Christmas!

AastaLLL · January 3, 2025, 7:02am

Hi,

Happy New Year~

We got some feedback from our internal team.
Could you follow the requirements to do the experience?

Given the provided information it looks like the quantized convolutions run slower than the non-quantized convolutions.
This could be due to (1) this is Orin; (2) TRT 8.5; (3) very small number of channels and BS=1; (4) your script is adding type constraints that are limiting TRT’s kernel decisions.

Please try to avoid all of the code in build_engine_bisenet.py lines 29-99 and just let TRT choose the precisions.
If that does not work please try to upgrade from TRT 8.5 to some newer TRT version where hopefully there are better kernels.

Thanks.

Thoan2614 · January 7, 2025, 4:14pm

Hi,

Thank you very much to your internal team for the feedback.

Indeed, the quantized model runs slower than the non-quantized one, certainly due to the introduced “Reformat” layers.

I tested your suggestions and made following observations:

(1) This is due to Orin:

Well, for quicker debugging, I am testing my code on x86 and, even with this more performant platform, I am not able to obtain full INT8-quantization. I will move back to Orin only after successful total quantization on x86.

(2) TRT 8.5:

This is well probable that newer versions of TRT behaves differently. For my use-case, I am unfortunately forced to stick to version 8.5.2 since I can’t use a JetPack newer than version 5.1.1.

(3) Very small number of channels and BS=1:

The number of channels is inherent to the model itself and I don’t want to change it.
When testing on x86 with the command python run_quantization.py --int8 --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch -m /model/bisenetv2_simplified.onnx, I didn’t specify the batch size and used “explicit_batch” option instead. Switching to a specific batch size for instance with python run_quantization.py --int8 --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --max-batch-size=64 -m /model/bisenetv2_simplified.onnx, I get following error when reaching line parser.parse(f.read()) for parsing the model:
```
In node -1 (importModel): INVALID_VALUE: Assertion failed: !_importer_ctx.network()->hasImplicitBatchDimension() && "This version of the ONNX parser only supports TensorRT INetworkDefinitions with an explicit batch dimension. Please ensure the network was created using the EXPLICIT_BATCH NetworkDefinitionCreationFlag."
```
Consequently, at least with the version of TensorRT I am using, it seems that I am forced to use the flag --explicit-batch instead of specifying a specific max_batch_size.

(4) Script adding type constraints limiting TRT’s kernel decisions:

When removing the part of code you suggest in order to let TRT choose the precision of the layers, I get following error indicating that INT8 cannot be used as precision type for the specific layer called “/backbone/bga/Concat_output_0”:

[01/06/2025-17:36:44] [TRT] [E] 3: /backbone/bga/Concat_output_0: cannot use precision Int8 with weights of type Int32
[01/06/2025-17:36:44] [TRT] [E] 4: [network.cpp::validate::3015] Error Code 4: Internal Error (Layer /backbone/bga/Concat_output_0 failed validation)
[01/06/2025-17:36:44] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
  File "/opt/project/production/src/production/bin/conversion/run_quantization.py", line 513, in build_engine_bisenet
    raise RuntimeError("Failed to build the TensorRT engine.")
RuntimeError: Failed to build the TensorRT engine.

This could be solved by explicitly setting the precision of the layer “/backbone/bga/Concat_output_0” to INT32 as follows:

layer_name_list = ["/backbone/bga/Concat_output_0"]
for i in range(network.num_layers):
    layer = network.get_layer(i)
    if layer.name in layer_name_list:
        layer.precision = trt.DataType.INT32
        for j in range(layer.num_outputs):
            layer.set_output_type(j, trt.DataType.INT32)
        logging.info(f"Unconvertible layer {layer.name} set to INT32 precision.")

In fact, this is the only constraint that has to be applied in order to successfully serialize the model 🎉. However, when analyzing the layer types using trex, I get following statistics:

2025-01-07_BiSeNetV2Quantized_LayerPrecisions1792×888 72.6 KB
As you can see, barely 75% of the layers have been quantized into INT8 and the remaining ones are have FP32 precision type.
Looking at the picture sent in my initial post, forcing the precision of some layers (as I did initially) could improve the number of INT8-quantized layers:

BiSeNetV2Quantized_LayerPrecisions2784×1402 191 KB
In conclusion, it seems that letting TRT choose the precision types does not allow to have a complete INT8-quantization.
You can find attached the updated script “run_quantization.py”
run_quantization.zip (5.8 KB)
(i.e., the one without the precision constraints) and the trex report
trex_report_without_constraints.zip (3.1 MB)
summarizing those results.
- Notes:
  - I observed that adding the line config.set_flag(trt.BuilderFlag.INT8) (now commented out in the script) brings no difference to the results.
  - Using the line config.set_flag(trt.BuilderFlag.FP16) (now commented out in the script) instead presents worse results:
    
    2025-01-07_BiSeNetV2Quantized_LayerPrecisions_BuilderFlagFP161628×886 81 KB
  - The line config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS) (ensuring precision constraints are strictly obeyed) has to be commented out in order to avoid following issue related to layer “/Resize_1” which output apparently has to be in floating point precision:
```
[01/07/2025-13:12:06] [TRT] [E] 2: [optimizer.cpp::filterPTQFormats::4442] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. /Resize_1 requires output to be in floating point precision. No format available with unquantized output. Try setting layer output type using ILayer::setOutputType.)
[01/07/2025-13:12:06] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
```
    - To fix above issue, layer type has to be constrained to trt.DataType.HALF for a bunch of layers as it can be seen in the script “run_quantization.py”. This can be verified by setting obey_precision_constraints = True in this script. In this case, constraining the precision, I obtain following results that are fairly similar to what I obtained with my very first version of the script constraining precision types:
      
      2025-01-07_BiSeNetV2Quantized_LayerPrecisions1823×907 73.9 KB
      
      trex_report_with_constraints.zip (3.3 MB)

To conclude:

Due to the introduction of “Reformat” layers, an incomplete INT8-quantization is useless since the resulting quantized network is slower than the standard one. Ideally I would like to have a network with all layers quantized to INT8.
Based on the new version of my script letting TRT 8.5.2.2 choose the precision of the layers, are you aware of some tactics for increasing the proportion of INT8 layers and tend towards a network fully quantized into INT8?
For the next steps, out of curiosity, I will try to adapt my script to use a newer version of TRT (namely TRT version 10.3.0 compatible with JetPack 6.1) and see if I can convert more layers into INT8.

Thank you again very much for your help and support!

AastaLLL · January 8, 2025, 2:57am

Hi,

Thanks for your experiments. We have shared these data with our internal team.

About the (3) Very small number of channels and BS=1:
As channel and batch size are usually decided in the model, this recommendation is to use the model with a different architecture of larger channels or batch size.

Since our internal team usually deals with issues with the latest release, please help to collect the info with TensorRT 10.3 as well.
After figuring out the root cause, we can check how to apply the fix/WAR (if possible) back to the TensorRT 8.5.

Thanks.

Thoan2614 · January 8, 2025, 5:12pm

Hi @AastaLLL,

Alright, thank you for precising point (3).

I will keep you informed about my progress when adapting my script and testing with TRT 10.3.

Best regards.

AastaLLL · January 10, 2025, 3:28am

Hi

Going forward we recommend using explicit-quantization (QDQ) because implicit-quantization is deprecated.

Implicit quantization (IQ) works best when TRT is left alone to choose the best kernels and we test its effectiveness by looking at the performance and not at the number of layers that execute in INT8.
This is because in IQ we give TRT the power to choose which data-types and kernels to use for each layer, based on which configuration gives the best performance (more details are in the TRT user guide).

TRT might choose FP32/FP16/BF16 kernels for some layers that it determined that this is the most performant configuration. This can happen for various reasons: (1) the cost of using int8 is actually higher than using FP32 (because of extra reformatting or because of HW considerations); (2) TRT might have poorly performing int8 kernels for this configuration; (3) TRT may not have int8 kernels.

Our internal will test what’s going on with that failing concat and also test the performance with the concat WAR.

Thanks.

AastaLLL · January 13, 2025, 2:38am

Hi,

Do you get it works with TensorRT 10?
Thanks.

AastaLLL · January 16, 2025, 3:31am

Hi,

Could you share the performance with the concat WAR?
Since the number of layers in INT8 is not a good indicator as mentioned before.

Thanks.

Thoan2614 · January 17, 2025, 9:17am

Hi @AastaLLL,

Sorry for my late reply and thank you for recommending using explicit quantization instead of implicit quantization. I am now finishing testing with TensorRT 10.3.0 and will come back to you shortly.

Regarding “concat WAR”, do you mean the workaround for layer “/backbone/bga/Concat_output_0” where I explicitly set its precision type to INT32?

Thank you for your help.

AastaLLL · January 20, 2025, 5:20am

Hi,

Yes, our internal team wants to know the performance with and without quantization.
This gives us some info about the TensorRT behavior.

Thanks.

Topic		Replies	Views
Unable to build model engine for INT8 yolov8m quantized using tensorrt model optimizer TensorRT jetson , deepstream	5	493	September 24, 2024
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	851	December 3, 2023
TensorRT the inference is slow for the QAT model comparing to the PTQ case Jetson AGX Xavier tensorrt , nvbugs	19	1630	January 16, 2023
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1920	June 14, 2021
Int8 TensorCores for Jetson Jetson AGX Xavier tensorrt	7	1268	April 26, 2023
Trtexec conversion to int8 Jetson AGX Orin tensorrt	2	1216	October 11, 2023
TensorRT generated QAT engine, why the engine is bigger than pretrained fp16 engine? TensorRT	3	1331	January 4, 2022
[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16 TensorRT tensorrt , python , onnx , natural-language-processing-nlp	4	3061	January 6, 2022
Why while ONNX-TensorRT conversion with INT8 quantizations some layers are not quantized? TensorRT tensorrt , pytorch , onnx	12	2786	December 4, 2022
Post quantization aware training is slower than fp16 and post quantization TensorRT	12	2728	September 25, 2024

Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX

Related topics