Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX

Dear community,

In order to optimize a semantic segmentation model running on Jetson Orin NX, I am interested in Post-Training Quantization (PTQ). The model I work on is a BiSeNet V2 trained on Cityscapes in which I am trying to convert the layers from FP32 to INT8 using TensorRT’s conversion tool based on class trt.IInt8EntropyCalibrator2. I am aiming for full quantization in INT8, but some layers visibly cannot be converted.

So far, I have only obtained partial quantization with some layers selectively left in FP16 (since they visibly cannot be converted to INT8) and others successfully converted to INT8 as you can see on the pictures below generated with “TREx”. Note that I have used the ONNX model simplification library “onnx-simplifier” for increasing the number of convertible layers.

As you can see in the second of the following two images, the BiSeNet input is FP32, the output FP16 and 25 layers are FP16 scattered around the middle of the network:
- Quantized BiSeNet V2:


- Standard BiSeNet V2:

As reference, here is the ONNX model, command and function I use for quantization:
- BiSeNet V2 ONNX:
simplified.zip (12.0 MB)
- Command: python production/src/production/bin/conversion/run_quantization.py --int8 --calibration-data=/media/10TBHardDisk25/Datasets/cityscapes_train_for_calib/train/ --calibration-cache=/mnt/erx/caches/cityscapes_train_calib.cache --explicit-batch -m /media/10TBHardDisk25/BiSeNetV2Tests/simplified.onnx.
- Function: build_engine_bisenet() of the Python script below:
build_engine_bisenet.zip (1.7 KB)

This incomplete quantization has an impact on the model’s inference time (5.6 [ms] for the quantized model vs. 4.6 [ms] for the standard model). Indeed, not only does the quantized model contain 109 layers versus 86 for the standard model:

But, as you can also see in the figures below generated with “TREx”, this “mixed-precision” approach introduces “Reformat” layers with a higher overall latency than the standard model:


These “Reformat Layers” are apparently typically inserted to handle tensor precision conversions and seem to introduce additional latency (increased memory access and computational overhead) thus increasing the overall latency of the model which counterbalances the performance gain expected from quantization.

I believe that this non-complete INT8 network phenomenon may be due to the Average Pooling that TensorRT fails to quantize. Indeed, semantic segmentation models usually have a lot of “Average Pooling” and in the case of BiSeNet V2 quantization I had to force their type to trt.DataType.HALF.

I would be interested to know whether, in your experience with post-training quantization in INT8, you have also encountered situations where some of the layers cannot be converted. In addition, what would be your approach to still benefit from this quantization stage.

Thank you very much for your precious insight!


TensorRT version: 8.5.2.2
CUDA version: 11.8
Jetpack version: 5.1.1
PyTorch version: 2.1.1
Operating System + version: Ubuntu 20.04
Python version: 3.8.10

Hi,

How do you apply the quantization to the model?
Do you do it manually with API?

If so, have you tried to do that with trtexec?

Thanks.

Hello @AastaLLL ,

I don’t use trtexec. I am effectively using the API to quantize the model.

More precisely, I first generate a calibration cache with a function returning an object deriving from class trt.IInt8EntropyCalibrator2 (this is done in the function handle_precision_flags() as you can see in my script build_engine_bisenet.py shared in a ZIP file above). After that I build the quantized model with serialized_engine = builder.build_serialized_network(network, config).

From my experience, I have already used trtexec but only for converting ONNX to “standard” .engine without quantization. If you think using trtexec for quantization is better than using the API, could you please point me to the relevant documentation? Do you refer to “Serialized Engine Generation”?

Thank you very much in advance! 🙏

Hi,

The two approaches are expected to generate the same output.
Based on your description, since there are some non-quantized layers in between, TensorRT needs to add some reformat handler to deal with the precision changes that can impact the perf.

We need to give it a try to gather more info.
Will keep you updated.

Thanks.

Hi,

Would you mind sharing the runnable building script and analysis source so we can check it further in our environment?

Thanks.

Hi @AastaLLL

Of course, you can find my code here:
2024-12-18_ScriptsSharedWithNVIDIA.zip (20.5 MB)

The script run_quantization.py launches the quantization. This zipped folder also contains the model as well as the analysis report generated with trex.

Don’t hesitate to come back to me if you have any trouble getting the code to run.

Thank you again very much for your support!

Hi,

Thanks for sharing this.
We will give it a try and let you know the following.

Thanks.

Hi,

Thanks for sharing the script.

As we are checking locally, it’s possible for you to upgrade the JetPack version to our latest v6.1?
Since the latest software usually contains several bug fixes and new features so the behavior might be different.

Thanks.

Hi,

Thank you for the recommendation, but unfortunately not. That’s precisely the limitation of my code: I have to work with a Jetson Orin NX installed on a machine that is limited to JetPack v5.1. But yes, I’m very aware of the fact that staying stuck at JetPack v5.1 can be the source of many bugs that have subsequently been fixed.

Thank you again for your time.

Hi,

Understood, we need to check this with our internal team and provide more info to you later.
Our internal team usually works on the issue with the latest release.

Once we get further info (there might be some delay due to the holiday season) we can see if it is possible to apply to the JetPack 5.1.

Thanks.

Hi @AastaLLL

That’s interesting that some layers can’t be converted to INT8 even with the latest version of TensorRT 🤔. Thank you for your feedback and for testing 🙏.

Don’t worry, I won’t be actively working for the next 2 weeks either. I look forward to hearing from you.

Thank you again for your efforts and Merry Christmas!

Hi,

Happy New Year~

We got some feedback from our internal team.
Could you follow the requirements to do the experience?

Given the provided information it looks like the quantized convolutions run slower than the non-quantized convolutions.
This could be due to (1) this is Orin; (2) TRT 8.5; (3) very small number of channels and BS=1; (4) your script is adding type constraints that are limiting TRT’s kernel decisions.

Please try to avoid all of the code in build_engine_bisenet.py lines 29-99 and just let TRT choose the precisions.
If that does not work please try to upgrade from TRT 8.5 to some newer TRT version where hopefully there are better kernels.

Thanks.

Hi,

Thank you very much to your internal team for the feedback.

Indeed, the quantized model runs slower than the non-quantized one, certainly due to the introduced “Reformat” layers.

I tested your suggestions and made following observations:

(1) This is due to Orin:

  • Well, for quicker debugging, I am testing my code on x86 and, even with this more performant platform, I am not able to obtain full INT8-quantization. I will move back to Orin only after successful total quantization on x86.

(2) TRT 8.5:

  • This is well probable that newer versions of TRT behaves differently. For my use-case, I am unfortunately forced to stick to version 8.5.2 since I can’t use a JetPack newer than version 5.1.1.

(3) Very small number of channels and BS=1:

  • The number of channels is inherent to the model itself and I don’t want to change it.
  • When testing on x86 with the command python run_quantization.py --int8 --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch -m /model/bisenetv2_simplified.onnx, I didn’t specify the batch size and used “explicit_batch” option instead. Switching to a specific batch size for instance with python run_quantization.py --int8 --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --max-batch-size=64 -m /model/bisenetv2_simplified.onnx, I get following error when reaching line parser.parse(f.read()) for parsing the model:
    In node -1 (importModel): INVALID_VALUE: Assertion failed: !_importer_ctx.network()->hasImplicitBatchDimension() && "This version of the ONNX parser only supports TensorRT INetworkDefinitions with an explicit batch dimension. Please ensure the network was created using the EXPLICIT_BATCH NetworkDefinitionCreationFlag."
    
  • Consequently, at least with the version of TensorRT I am using, it seems that I am forced to use the flag --explicit-batch instead of specifying a specific max_batch_size.

(4) Script adding type constraints limiting TRT’s kernel decisions:

  • When removing the part of code you suggest in order to let TRT choose the precision of the layers, I get following error indicating that INT8 cannot be used as precision type for the specific layer called “/backbone/bga/Concat_output_0”:
    [01/06/2025-17:36:44] [TRT] [E] 3: /backbone/bga/Concat_output_0: cannot use precision Int8 with weights of type Int32
    [01/06/2025-17:36:44] [TRT] [E] 4: [network.cpp::validate::3015] Error Code 4: Internal Error (Layer /backbone/bga/Concat_output_0 failed validation)
    [01/06/2025-17:36:44] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
    Traceback (most recent call last):
      File "/opt/project/production/src/production/bin/conversion/run_quantization.py", line 513, in build_engine_bisenet
        raise RuntimeError("Failed to build the TensorRT engine.")
    RuntimeError: Failed to build the TensorRT engine.
    
  • This could be solved by explicitly setting the precision of the layer “/backbone/bga/Concat_output_0” to INT32 as follows:
    layer_name_list = ["/backbone/bga/Concat_output_0"]
    for i in range(network.num_layers):
        layer = network.get_layer(i)
        if layer.name in layer_name_list:
            layer.precision = trt.DataType.INT32
            for j in range(layer.num_outputs):
                layer.set_output_type(j, trt.DataType.INT32)
            logging.info(f"Unconvertible layer {layer.name} set to INT32 precision.")
    
  • In fact, this is the only constraint that has to be applied in order to successfully serialize the model 🎉. However, when analyzing the layer types using trex, I get following statistics:
  • As you can see, barely 75% of the layers have been quantized into INT8 and the remaining ones are have FP32 precision type.
  • Looking at the picture sent in my initial post, forcing the precision of some layers (as I did initially) could improve the number of INT8-quantized layers:
  • In conclusion, it seems that letting TRT choose the precision types does not allow to have a complete INT8-quantization.
  • You can find attached the updated script “run_quantization.py”
    run_quantization.zip (5.8 KB)
    (i.e., the one without the precision constraints) and the trex report
    trex_report_without_constraints.zip (3.1 MB)
    summarizing those results.
    • Notes:
      • I observed that adding the line config.set_flag(trt.BuilderFlag.INT8) (now commented out in the script) brings no difference to the results.
      • Using the line config.set_flag(trt.BuilderFlag.FP16) (now commented out in the script) instead presents worse results:
      • The line config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS) (ensuring precision constraints are strictly obeyed) has to be commented out in order to avoid following issue related to layer “/Resize_1” which output apparently has to be in floating point precision:
        [01/07/2025-13:12:06] [TRT] [E] 2: [optimizer.cpp::filterPTQFormats::4442] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. /Resize_1 requires output to be in floating point precision. No format available with unquantized output. Try setting layer output type using ILayer::setOutputType.)
        [01/07/2025-13:12:06] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
        
        • To fix above issue, layer type has to be constrained to trt.DataType.HALF for a bunch of layers as it can be seen in the script “run_quantization.py”. This can be verified by setting obey_precision_constraints = True in this script. In this case, constraining the precision, I obtain following results that are fairly similar to what I obtained with my very first version of the script constraining precision types:

          trex_report_with_constraints.zip (3.3 MB)

To conclude:

  • Due to the introduction of “Reformat” layers, an incomplete INT8-quantization is useless since the resulting quantized network is slower than the standard one. Ideally I would like to have a network with all layers quantized to INT8.
  • Based on the new version of my script letting TRT 8.5.2.2 choose the precision of the layers, are you aware of some tactics for increasing the proportion of INT8 layers and tend towards a network fully quantized into INT8?
  • For the next steps, out of curiosity, I will try to adapt my script to use a newer version of TRT (namely TRT version 10.3.0 compatible with JetPack 6.1) and see if I can convert more layers into INT8.

Thank you again very much for your help and support!

Hi,

Thanks for your experiments. We have shared these data with our internal team.

About the (3) Very small number of channels and BS=1:
As channel and batch size are usually decided in the model, this recommendation is to use the model with a different architecture of larger channels or batch size.

Since our internal team usually deals with issues with the latest release, please help to collect the info with TensorRT 10.3 as well.
After figuring out the root cause, we can check how to apply the fix/WAR (if possible) back to the TensorRT 8.5.

Thanks.

Hi @AastaLLL,

Alright, thank you for precising point (3).

I will keep you informed about my progress when adapting my script and testing with TRT 10.3.

Best regards.

Hi

Going forward we recommend using explicit-quantization (QDQ) because implicit-quantization is deprecated.

Implicit quantization (IQ) works best when TRT is left alone to choose the best kernels and we test its effectiveness by looking at the performance and not at the number of layers that execute in INT8.
This is because in IQ we give TRT the power to choose which data-types and kernels to use for each layer, based on which configuration gives the best performance (more details are in the TRT user guide).

TRT might choose FP32/FP16/BF16 kernels for some layers that it determined that this is the most performant configuration. This can happen for various reasons: (1) the cost of using int8 is actually higher than using FP32 (because of extra reformatting or because of HW considerations); (2) TRT might have poorly performing int8 kernels for this configuration; (3) TRT may not have int8 kernels.

Our internal will test what’s going on with that failing concat and also test the performance with the concat WAR.

Thanks.

Hi,

Do you get it works with TensorRT 10?
Thanks.

Hi,

Could you share the performance with the concat WAR?
Since the number of layers in INT8 is not a good indicator as mentioned before.

Thanks.