Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX

Hi,

Do you get the data for TensorRT 10.3?

We had some discussions with our internal team and we feel the TensorRT behavior is expected.
TensorRT is optimized based on performance.
So if a layer runs faster on other data types, TensorRT will choose it instead of int8.

Thanks.

Hello @AastaLLL,

I am sorry for the late reply.

Thank you for the precision regarding the fact that TRT is optimized based on performance, this would make sense.

Here are a few feedbacks on my experiences:


Experiment 1

I finally tested and compared the two quantization experiments based on TRT 8.5.2 (with and without data type constraints):

A) With constraints:

  • As you can see in the provided script
    run_quantization.zip (5.8 KB)
    , here is the list of layers I was forced to set the data type to HALF in order to avoid error Error Code 10: Internal Error (Could not find any implementation for node ...:
    • “/Resize_1”,
    • “/backbone/semantic/stage1/pool/MaxPool”
    • “/backbone/semantic/stage2/0/dwconv/0/conv/Conv”
    • “/backbone/bga/detail_down/1/AveragePool”
    • “/backbone/semantic/stage2/1/dwconv/0/conv/Conv”
    • “/backbone/semantic/stage2/1/dwconv/0/activate/Relu”
    • “/backbone/semantic/stage3/0/dwconv/0/conv/Conv”
    • “/backbone/semantic/stage3/1/dwconv/0/conv/Conv”
    • “/backbone/semantic/stage3/1/dwconv/0/activate/Relu”
    • “/backbone/semantic/stage4/0/dwconv/0/conv/Conv”
    • “/backbone/semantic/stage4/1/dwconv/0/conv/Conv”
    • “/backbone/semantic/stage4/1/dwconv/0/activate/Relu”
    • “/backbone/semantic/stage4/2/dwconv/0/conv/Conv”
    • “/backbone/semantic/stage4/2/dwconv/0/activate/Relu”
    • “/backbone/semantic/stage4/3/dwconv/0/conv/Conv”
    • “/backbone/semantic/stage4/3/dwconv/0/activate/Relu”
    • “/backbone/semantic/stage4_CEBlock/gap/0/GlobalAveragePool”
  • Precision statistics: ~ 3/4 INT8, ~ 1/4 FP16, 1 layer in FP32.

  • Complete trex report here:
    sima_simplified_with_constraints.engine.zip (3.3 MB)

B) Without constraints:

In both cases, I used the command python run_quantization.py --int8 --verbose=DEBUG --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch -m /model/bisenetv2_simplified.onnx.

Comparison of both engine models realized with trex script called “compare_engines.ipynb”:

Conclusion:

  • The data type of layer “/backbone/bga/Concat_output_0” has to be set to INT32 in any case. Otherwise the script can’t continue building the engine.
  • Letting TensorRT choose the data type of layers doesn’t allow to get a complete INT8 quantization. Forcing data types to HALF for some layers obviously doesn’t contribute to complete quantization either, but at least we see that using contraints we tend to slightly increase the number of INT8 layers.
  • Even if with constraints we have more INT8 layers, the overall latency of the engine is greater than the one of the engine without constraints. Also, when constraining the data types, more “Reformat” layers appear. This contributes to the greater overall latency.

Experiment 2

I also investigated TRT 10.3.0. In this experiment, I reused the exact same script made for TRT 8.5.2 and just slightly adapted some parts to make it compatible with TRT 10.3.0 instead. Here are the two adapted snippets that you can find in the provided script
run_quantization.zip (6.8 KB)
as well:

  • Snippet 1:
    #"strict_types": trt.BuilderFlag.STRICT_TYPES,
    # --- Adaptation for TensorRT 10.3.0
    "strict_types": trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS,
    
  • Snippet 2:
    #config.max_workspace_size = 2**30  # 1GiB
    # --- Adaptation for TensorRT 10.3.0
    config.set_memory_pool_limit(
        trt.MemoryPoolType.WORKSPACE, 2**30
    )
    

Here again, I used the command python run_quantization.py --int8 --verbose=DEBUG --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch -m /model/bisenetv2_simplified.onnx.

Unfortunately, with TRT 10.3.0, I couldn’t let the program choose the best data types. In fact, I had to successively force the type of some layers to avoid Error Code 10: Internal Error (Could not find any implementation for node .... This way, I could bypass the issue by setting the data type of layers “/backbone/detail/detail_branch.2/0/conv/Conv”, “/backbone/semantic/stage1/convs/0/conv/Conv” and “/backbone/semantic/stage1/convs/1/conv/Conv” to BF16 as you can see in the provided script. Sadly, when reaching layer “/backbone/semantic/stage1/pool/MaxPool”, I individually tested each data types listed in the documentation here, but always got following error when trying to build the quantized engine:

[TRT] [E] IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node /backbone/semantic/stage1/pool/MaxPool.)

Conclusion: Whereas there is no issue with TRT 8.5.2, TRT 10.3.0 apparently can’t find any implementation for the MaxPool layer.


Experiment 3

As you recommended since implicit quantization is deprecated, I finally also tried explicit quantization by keeping the same script but changing the command by using --explicit-precision flag (since I think it is the only way to do explicit quantization without changing my current script). The command I hence used here was python run_quantization.py --int8 --verbose=DEBUG --calibration-data=/path/to/cityscapes/calibration/dataset/ --calibration-cache=cityscapes_calib.cache --explicit-batch --explicit-precision -m /model/bisenetv2_simplified.onnx.

I tested this only in the case of TRT 8.5.2 and no type constraints. However, I noticed I also had to set layer “/backbone/bga/Concat_output_0” to INT32. This seems to be always necessary in order to avoid following issue:

[TRT] [E] 3: /backbone/bga/Concat_output_0: cannot use precision Int8 with weights of type Int32
[TRT] [E] 4: [network.cpp::validate::3015] Error Code 4: Internal Error (Layer /backbone/bga/Concat_output_0 failed validation)

This way, I could generate an engine but observed there is no difference between the engine built with and without flag --explicit-precision:


Conclusion: In case of TRT 10.3.0, using flag --explicit-precision doesn’t work since EXPLICIT_PRECISION is deprecated. In fact, TensorRT now automatically infers precision based on the provided calibration scales and user-defined precision constraints. Alternatively, I tried to do explicit quantization with TRT 10.3.0 using “explicit quantization” mode by specifying scales for all tensors in the network. This unfornately didn’t help to generate an engine model.


Overall conclusion

  • It seems that intentionally constraining some layer types is necessary to build the engine (at least 1 for TRT 8.5.2 and apparently minimum 4 for TRT 10.3.0 were it blocked at layer “/backbone/semantic/stage1/pool/MaxPool”).
  • Engine built with constraints present slightly more INT8 layers but a greater latency (due to the introduction of “Reformat” layers).
  • Explicit quantization (at least using the method of adding flag --explicit-precision) didn’t bring any improvements in terms of quantization.

To further my experiments and test the limits of complete INT8 quantization with TensorRT, I will now explore two options:

  1. Execute same script with a much simpler model architecture.
  2. Change strategy and leverage TensorRT’s Quantization Toolkit PyTorch library (which documentation is also available here) for doing explicit quantization.

I hope this feedback will also be useful to you. Thank you again for your help and patience!

Note: In this reply, I just joined the very scripts for running quantization. The other helper and calibrator scripts remained unchanged in comparison to what I sent in my previous posts above.

Hi,

Thanks a lot for sharing the experiments.

In the simplified.onnx model shared in the topic of the topic, the layer /backbone/bga/Concat_output_0 is the size of the two Resize layer (/backbone/bga/Resize and /backbone/bga/Resize_1).
So this should be a constant layer.

For a constant layer, maybe you can try to fold the constant to see if it helps.

Thanks.

Hi @AastaLLL,

Here are a few feedbacks about my recent experiments with TensorRT 8.5.2.2 (not 10.3.0):

A) Constant folding:

Thanks a lot for your suggestion of using constant folding and noticing that the layer “/backbone/bga/Concat_output_0” is the size of the two Resize layer (“/backbone/bga/Resize” and “/backbone/bga/Resize_1”).

However, I effectively applied constant folding as described at “GitHub - NVIDIA/TensorRT - tools/onnx-graphsurgeon/examples/05_folding_constants” in order to simplify my ONNX model but when proceding to INT8-quantization step, I still get the same issue telling that the layer “/backbone/bga/Concat_output_0” must imperatively have precision type Int32 and cannot use precision Int8:

[TRT] [E] 3: /backbone/bga/Concat_output_0: cannot use precision Int8 with weights of type Int32
[TRT] [E] 4: [network.cpp::validate::3015] Error Code 4: Internal Error (Layer /backbone/bga/Concat_output_0 failed validation)

By the way, I am obviously able to see this layer “/backbone/bga/Concat_output_0” in the ONNX file, but not in the generated .engine file (sucessfully generated when “/backbone/bga/Concat_output_0” precision type is forced to INT32). Do you have any idea about why this is the case? Has this layer become the combination of several layers through quantization? As reference, here 2025-03-12_ONNXLayerNames.zip (813 Bytes) is the list of layers in the ONNX model and here 2024-12-09_OriginalLayerNames.zip (1015 Bytes) the list of the layers in the .engine model (that can also be visualized from the trex report in figure “% Latency Budget Per Layer” here LatencyBudgetPerLayer_WithLayerNames_1.zip (1.1 MB).

B) Int8-quantization of vanilla model:

I used the exact same script and procedure for trying to quantize a much smaller and simpler model that I called “TinySegNet”. The purpose was here to see if I could at least fully quantize a basic model. This simplistic model is not designed for performance and only contains 3 convolutional layers and 1 upsampling layer at the end:

Unfortunately, it seems that the full INT8-quantization is apparently not possible here either:


Now, I would like to move away from the BiSeNet V2 model and I am wondering if, by any chance, you could provide me with a tiny semantic segmentation model where you can prove that full INT8-quantization with TensorRT is effectively possible. Based on this example where PTQ works, I could visualize layer precision distribution with trex and build upon that for testing and adapting more complex models. Would it be possible to share such a working example with me (ONNX + method)?

Thank you again very much for your help!

Best regards

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.