Unable to build model engine for INT8 yolov8m quantized using tensorrt model optimizer

ashutoshpanda2002 · August 27, 2024, 5:36am

Description

• Hardware Platform (Jetson / GPU) - Jetson Orin AGX 64 GB Developer Kit
• DeepStream Version - Docker Container - deepstream:7.0-triton-multiarch
• JetPack Version (valid for Jetson only) - 6.0
• TensorRT Version - 10.1+
• NVIDIA GPU Driver Version (valid for GPU only) -
• Issue Type( questions, new requirements, bugs) - Question

Hi,

Since the implicit quantization method has been deprecated after tensorrt version 10.1 and explicit quantization doesn’t have any examples available, I tried quantizing the model using model optimizer library.

link to the repo

And I used this snippet to quantize:

python -m modelopt.onnx.quantization
–onnx_path=model.onnx
–quantize_mode=int8
–calibration_data=calib.npy
–calibration_method=minimax
–output_path=quant.onnx

But trtexec is unable to build the model engine for this int8 model and threw error code 4 - stating that builder could not be configured.

can someone please let me know if there are any examples on explicit quantization or on resolving this error and if this is the right way to quantize the model?

AakankshaS · August 29, 2024, 8:15pm

Hi @ashutoshpanda2002 , 'Can you please help us with the detailed logs and model and repro steps/scripts

ashutoshpanda2002 · August 31, 2024, 7:09am

Steps we followed to quantize the model: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq

Output of the above code was:

/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/int4.py:29: UserWarning: Using slower INT4 ONNX quantization using numpy. Install JAX (Installation — JAX documentation) for faster quantization: No module named ‘jax’
warnings.warn(
Loading extension modelopt_round_and_pack_ext…

INFO:root:Model last.onnx with opset_version 16 is loaded.
INFO:root:Quantization Mode: int8
INFO:root:Quantizable op types in the model: [‘Add’, ‘Mul’, ‘MaxPool’, ‘Conv’]
INFO:root:Building non-residual Add input map …
INFO:root:Searching for hard-coded patterns like MHA, LayerNorm, etc. to avoid quantization.
INFO:root:Building KGEN/CASK targeted partitions …
INFO:root:Classifying the partition nodes …
INFO:root:Total number of nodes: 526
INFO:root:Skipped node count: 0
WARNING:root:Please consider to run pre-processing before quantization. Refer to example: onnxruntime-inference-examples/quantization/image_classification/cpu/ReadMe.md at main · microsoft/onnxruntime-inference-examples · GitHub
/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider ‘CUDAExecutionProvider’ is not in available provider names.Available providers: ‘AzureExecutionProvider, CPUExecutionProvider’
warnings.warn(
WARNING:root:Please consider pre-processing before quantization. See onnxruntime-inference-examples/quantization/image_classification/cpu/ReadMe.md at main · microsoft/onnxruntime-inference-examples · GitHub
INFO:root:Deleting QDQ nodes from marked inputs to make certain operations fusible …
INFO:root:Quantized onnx model is saved as quant.onnx
INFO:root:Total number of quantized nodes: 107
INFO:root:Quantized node types: {‘Add’, ‘Concat’, ‘MaxPool’, ‘Conv’}

After this we used trtexec to build the model engine:


root@ubuntu:/usr/src/tensorrt/bin# trtexec --onnx=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx \
--saveEngine=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine \
--minShapes="input":1x3x640x640 --optShapes="input":3x3x640x640 --maxShapes="input":3x3x640x640&
[1] 179
root@ubuntu:/usr/src/tensorrt/bin# &&&& RUNNING TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx --saveEngine=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine --minShapes=input:1x3x640x640 --optShapes=input:3x3x640x640 --maxShapes=input:3x3x640x640
[08/31/2024-06:58:40] [I] === Model Options ===
[08/31/2024-06:58:40] [I] Format: ONNX
[08/31/2024-06:58:40] [I] Model: /opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx
[08/31/2024-06:58:40] [I] Output:
[08/31/2024-06:58:40] [I] === Build Options ===
[08/31/2024-06:58:40] [I] Max batch: explicit batch
[08/31/2024-06:58:40] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/31/2024-06:58:40] [I] minTiming: 1
[08/31/2024-06:58:40] [I] avgTiming: 8
[08/31/2024-06:58:40] [I] Precision: FP32
[08/31/2024-06:58:40] [I] LayerPrecisions: 
[08/31/2024-06:58:40] [I] Layer Device Types: 
[08/31/2024-06:58:40] [I] Calibration: 
[08/31/2024-06:58:40] [I] Refit: Disabled
[08/31/2024-06:58:40] [I] Version Compatible: Disabled
[08/31/2024-06:58:40] [I] ONNX Native InstanceNorm: Disabled
[08/31/2024-06:58:40] [I] TensorRT runtime: full
[08/31/2024-06:58:40] [I] Lean DLL Path: 
[08/31/2024-06:58:40] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/31/2024-06:58:40] [I] Exclude Lean Runtime: Disabled
[08/31/2024-06:58:40] [I] Sparsity: Disabled
[08/31/2024-06:58:40] [I] Safe mode: Disabled
[08/31/2024-06:58:40] [I] Build DLA standalone loadable: Disabled
[08/31/2024-06:58:40] [I] Allow GPU fallback for DLA: Disabled
[08/31/2024-06:58:40] [I] DirectIO mode: Disabled
[08/31/2024-06:58:40] [I] Restricted mode: Disabled
[08/31/2024-06:58:40] [I] Skip inference: Disabled
[08/31/2024-06:58:40] [I] Save engine: /opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine
[08/31/2024-06:58:40] [I] Load engine: 
[08/31/2024-06:58:40] [I] Profiling verbosity: 0
[08/31/2024-06:58:40] [I] Tactic sources: Using default tactic sources
[08/31/2024-06:58:40] [I] timingCacheMode: local
[08/31/2024-06:58:40] [I] timingCacheFile: 
[08/31/2024-06:58:40] [I] Heuristic: Disabled
[08/31/2024-06:58:40] [I] Preview Features: Use default preview flags.
[08/31/2024-06:58:40] [I] MaxAuxStreams: -1
[08/31/2024-06:58:40] [I] BuilderOptimizationLevel: -1
[08/31/2024-06:58:40] [I] Input(s)s format: fp32:CHW
[08/31/2024-06:58:40] [I] Output(s)s format: fp32:CHW
[08/31/2024-06:58:40] [I] Input build shape: input=1x3x640x640+3x3x640x640+3x3x640x640
[08/31/2024-06:58:40] [I] Input calibration shapes: model
[08/31/2024-06:58:40] [I] === System Options ===
[08/31/2024-06:58:40] [I] Device: 0
[08/31/2024-06:58:40] [I] DLACore: 
[08/31/2024-06:58:40] [I] Plugins:
[08/31/2024-06:58:40] [I] setPluginsToSerialize:
[08/31/2024-06:58:40] [I] dynamicPlugins:
[08/31/2024-06:58:40] [I] ignoreParsedPluginLibs: 0
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] === Inference Options ===
[08/31/2024-06:58:40] [I] Batch: Explicit
[08/31/2024-06:58:40] [I] Input inference shape: input=3x3x640x640
[08/31/2024-06:58:40] [I] Iterations: 10
[08/31/2024-06:58:40] [I] Duration: 3s (+ 200ms warm up)
[08/31/2024-06:58:40] [I] Sleep time: 0ms
[08/31/2024-06:58:40] [I] Idle time: 0ms
[08/31/2024-06:58:40] [I] Inference Streams: 1
[08/31/2024-06:58:40] [I] ExposeDMA: Disabled
[08/31/2024-06:58:40] [I] Data transfers: Enabled
[08/31/2024-06:58:40] [I] Spin-wait: Disabled
[08/31/2024-06:58:40] [I] Multithreading: Disabled
[08/31/2024-06:58:40] [I] CUDA Graph: Disabled
[08/31/2024-06:58:40] [I] Separate profiling: Disabled
[08/31/2024-06:58:40] [I] Time Deserialize: Disabled
[08/31/2024-06:58:40] [I] Time Refit: Disabled
[08/31/2024-06:58:40] [I] NVTX verbosity: 0
[08/31/2024-06:58:40] [I] Persistent Cache Ratio: 0
[08/31/2024-06:58:40] [I] Inputs:
[08/31/2024-06:58:40] [I] === Reporting Options ===
[08/31/2024-06:58:40] [I] Verbose: Disabled
[08/31/2024-06:58:40] [I] Averages: 10 inferences
[08/31/2024-06:58:40] [I] Percentiles: 90,95,99
[08/31/2024-06:58:40] [I] Dump refittable layers:Disabled
[08/31/2024-06:58:40] [I] Dump output: Disabled
[08/31/2024-06:58:40] [I] Profile: Disabled
[08/31/2024-06:58:40] [I] Export timing to JSON file: 
[08/31/2024-06:58:40] [I] Export output to JSON file: 
[08/31/2024-06:58:40] [I] Export profile to JSON file: 
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] === Device Information ===
[08/31/2024-06:58:40] [I] Selected Device: Orin
[08/31/2024-06:58:40] [I] Compute Capability: 8.7
[08/31/2024-06:58:40] [I] SMs: 16
[08/31/2024-06:58:40] [I] Device Global Memory: 62841 MiB
[08/31/2024-06:58:40] [I] Shared Memory per SM: 164 KiB
[08/31/2024-06:58:40] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/31/2024-06:58:40] [I] Application Compute Clock Rate: 1.3 GHz
[08/31/2024-06:58:40] [I] Application Memory Clock Rate: 1.3 GHz
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] TensorRT version: 8.6.2
[08/31/2024-06:58:40] [I] Loading standard plugins
[08/31/2024-06:58:40] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 33, GPU 5011 (MiB)
[08/31/2024-06:58:45] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1431, now: CPU 1223, GPU 6478 (MiB)
[08/31/2024-06:58:45] [I] Start parsing network model.
[08/31/2024-06:58:45] [I] [TRT] ----------------------------------------------------------------
[08/31/2024-06:58:45] [I] [TRT] Input filename:   /opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx
[08/31/2024-06:58:45] [I] [TRT] ONNX IR version:  0.0.10
[08/31/2024-06:58:45] [I] [TRT] Opset version:    16
[08/31/2024-06:58:45] [I] [TRT] Producer name:    onnx.quantize
[08/31/2024-06:58:45] [I] [TRT] Producer version: 0.1.0
[08/31/2024-06:58:45] [I] [TRT] Domain:           
[08/31/2024-06:58:45] [I] [TRT] Model version:    0
[08/31/2024-06:58:45] [I] [TRT] Doc string:       
[08/31/2024-06:58:45] [I] [TRT] ----------------------------------------------------------------
[08/31/2024-06:58:45] [W] [TRT] onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/31/2024-06:58:46] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[08/31/2024-06:58:46] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[08/31/2024-06:58:46] [I] Finished parsing network model. Parse time: 0.237448
[08/31/2024-06:58:46] [W] [TRT] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
[08/31/2024-06:58:46] [E] Error[4]: [network.cpp::validate::3040] Error Code 4: Internal Error (Int8 precision has been set for a layer or layer output, but int8 is not configured in the builder)
[08/31/2024-06:58:46] [E] Engine could not be created from network
[08/31/2024-06:58:46] [E] Building engine failed
[08/31/2024-06:58:46] [E] Failed to create engine from model or file.
[08/31/2024-06:58:46] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx --saveEngine=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine --minShapes=input:1x3x640x640 --optShapes=input:3x3x640x640 --maxShapes=input:3x3x640x640

SeetharamNS · September 16, 2024, 1:18am

Hi,

Try to optimize your model with trtexec(TensorRT command line wrapper) command.

shivamsark2001 · September 18, 2024, 7:19am

Thanks for your response.

Could you please spell out the command for optimizing the model using trtexec?
We tried building the model engine using tensorrt but that didn’t work. We searched for documentation detailing the process of quantization of a custom yolov8 .onnx models end to end but all we could find were fragments of information here and there, with the methods we found to be deprecated or not working in conjunction with trtexec.

Like the one we have detailed on this earlier response on this post, we referred to the documentation at TensorRT-Model-Optimizer/onnx_ptq at main · NVIDIA/TensorRT-Model-Optimizer · GitHub to build a calibration file and a quantized model, and then used tensorrt to try to build the engine which gave the following error:

ashutoshpanda2002:

[08/31/2024-06:58:45] [W] [TRT] onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/31/2024-06:58:46] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[08/31/2024-06:58:46] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[08/31/2024-06:58:46] [I] Finished parsing network model. Parse time: 0.237448
[08/31/2024-06:58:46] [W] [TRT] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
[08/31/2024-06:58:46] [E] Error[4]: [network.cpp::validate::3040] Error Code 4: Internal Error (Int8 precision has been set for a layer or layer output, but int8 is not configured in the builder)
[08/31/2024-06:58:46] [E] Engine could not be created from network
[08/31/2024-06:58:46] [E] Building engine failed
[08/31/2024-06:58:46] [E] Failed to create engine from model or file.
[08/31/2024-06:58:46] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx --saveEngine=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine --minShapes=input:1x3x640x640 --optShapes=input:3x3x640x640 --maxShapes=input:3x3x640x640

Is the process we’re following to build the calibration file, the quantized model, and the model engine correct, or is there a better method out there? If not where in the process we are following are we going wrong?

SeetharamNS · September 24, 2024, 5:26am

Hi,

I think there is a problem with your ONNX model parsing. will you be able to share(personally) your model so that I can test and get back to you.

Topic		Replies	Views
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1013	September 28, 2022
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	708	April 30, 2024
Unable to generate tensorrt engine using ds-tao-detection app for yolov4_tiny for QAT trained etlt model DeepStream SDK	16	548	June 14, 2023
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1396	July 12, 2022
DLA+INT8 compiled engine doesn't produce meaningful results Jetson Orin NX tensorrt , dla , jetson , deepstream	14	93	January 20, 2025
Trying to convert Yolov8.onnx into trt ( TensorRT version : 8.2, jetson-jetpack : 4.6.1) Jetson Xavier NX tensorrt , cuda , yolo	12	3352	May 17, 2023
ERORR with ONNX2TRT : Unknown embedded device detected Jetson Xavier NX onnx	18	4549	April 27, 2022
Process Killed when Generating a TensorRT Engine for the ViT models DeepStream SDK tensorrt , jetson-inference , deepstream	11	172	October 31, 2024
INT8 calibration file not generating, not building in INT8 mode TensorRT tensorrt , ubuntu , python , jetson-nano	15	2402	June 4, 2022
TensorRT quantization bug on Jetpack 6.0 Jetson AGX Orin tensorrt , pytorch	6	591	January 22, 2024

Unable to build model engine for INT8 yolov8m quantized using tensorrt model optimizer

Description

Related topics