Unable to build model engine for INT8 yolov8m quantized using tensorrt model optimizer

Description

• Hardware Platform (Jetson / GPU) - Jetson Orin AGX 64 GB Developer Kit
• DeepStream Version - Docker Container - deepstream:7.0-triton-multiarch
• JetPack Version (valid for Jetson only) - 6.0
• TensorRT Version - 10.1+
• NVIDIA GPU Driver Version (valid for GPU only) -
• Issue Type( questions, new requirements, bugs) - Question

Hi,

Since the implicit quantization method has been deprecated after tensorrt version 10.1 and explicit quantization doesn’t have any examples available, I tried quantizing the model using model optimizer library.

link to the repo

And I used this snippet to quantize:

python -m modelopt.onnx.quantization
–onnx_path=model.onnx
–quantize_mode=int8
–calibration_data=calib.npy
–calibration_method=minimax
–output_path=quant.onnx

But trtexec is unable to build the model engine for this int8 model and threw error code 4 - stating that builder could not be configured.

can someone please let me know if there are any examples on explicit quantization or on resolving this error and if this is the right way to quantize the model?

Hi @ashutoshpanda2002 , 'Can you please help us with the detailed logs and model and repro steps/scripts

Steps we followed to quantize the model: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq

Output of the above code was:

/usr/local/lib/python3.10/dist-packages/modelopt/onnx/quantization/int4.py:29: UserWarning: Using slower INT4 ONNX quantization using numpy. Install JAX (Installation — JAX documentation) for faster quantization: No module named ‘jax’
warnings.warn(
Loading extension modelopt_round_and_pack_ext…

INFO:root:Model last.onnx with opset_version 16 is loaded.
INFO:root:Quantization Mode: int8
INFO:root:Quantizable op types in the model: [‘Add’, ‘Mul’, ‘MaxPool’, ‘Conv’]
INFO:root:Building non-residual Add input map …
INFO:root:Searching for hard-coded patterns like MHA, LayerNorm, etc. to avoid quantization.
INFO:root:Building KGEN/CASK targeted partitions …
INFO:root:Classifying the partition nodes …
INFO:root:Total number of nodes: 526
INFO:root:Skipped node count: 0
WARNING:root:Please consider to run pre-processing before quantization. Refer to example: onnxruntime-inference-examples/quantization/image_classification/cpu/ReadMe.md at main · microsoft/onnxruntime-inference-examples · GitHub
/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:69: UserWarning: Specified provider ‘CUDAExecutionProvider’ is not in available provider names.Available providers: ‘AzureExecutionProvider, CPUExecutionProvider’
warnings.warn(
WARNING:root:Please consider pre-processing before quantization. See onnxruntime-inference-examples/quantization/image_classification/cpu/ReadMe.md at main · microsoft/onnxruntime-inference-examples · GitHub
INFO:root:Deleting QDQ nodes from marked inputs to make certain operations fusible …
INFO:root:Quantized onnx model is saved as quant.onnx
INFO:root:Total number of quantized nodes: 107
INFO:root:Quantized node types: {‘Add’, ‘Concat’, ‘MaxPool’, ‘Conv’}

After this we used trtexec to build the model engine:


root@ubuntu:/usr/src/tensorrt/bin# trtexec --onnx=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx \
--saveEngine=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine \
--minShapes="input":1x3x640x640 --optShapes="input":3x3x640x640 --maxShapes="input":3x3x640x640&
[1] 179
root@ubuntu:/usr/src/tensorrt/bin# &&&& RUNNING TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx --saveEngine=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine --minShapes=input:1x3x640x640 --optShapes=input:3x3x640x640 --maxShapes=input:3x3x640x640
[08/31/2024-06:58:40] [I] === Model Options ===
[08/31/2024-06:58:40] [I] Format: ONNX
[08/31/2024-06:58:40] [I] Model: /opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx
[08/31/2024-06:58:40] [I] Output:
[08/31/2024-06:58:40] [I] === Build Options ===
[08/31/2024-06:58:40] [I] Max batch: explicit batch
[08/31/2024-06:58:40] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/31/2024-06:58:40] [I] minTiming: 1
[08/31/2024-06:58:40] [I] avgTiming: 8
[08/31/2024-06:58:40] [I] Precision: FP32
[08/31/2024-06:58:40] [I] LayerPrecisions: 
[08/31/2024-06:58:40] [I] Layer Device Types: 
[08/31/2024-06:58:40] [I] Calibration: 
[08/31/2024-06:58:40] [I] Refit: Disabled
[08/31/2024-06:58:40] [I] Version Compatible: Disabled
[08/31/2024-06:58:40] [I] ONNX Native InstanceNorm: Disabled
[08/31/2024-06:58:40] [I] TensorRT runtime: full
[08/31/2024-06:58:40] [I] Lean DLL Path: 
[08/31/2024-06:58:40] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/31/2024-06:58:40] [I] Exclude Lean Runtime: Disabled
[08/31/2024-06:58:40] [I] Sparsity: Disabled
[08/31/2024-06:58:40] [I] Safe mode: Disabled
[08/31/2024-06:58:40] [I] Build DLA standalone loadable: Disabled
[08/31/2024-06:58:40] [I] Allow GPU fallback for DLA: Disabled
[08/31/2024-06:58:40] [I] DirectIO mode: Disabled
[08/31/2024-06:58:40] [I] Restricted mode: Disabled
[08/31/2024-06:58:40] [I] Skip inference: Disabled
[08/31/2024-06:58:40] [I] Save engine: /opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine
[08/31/2024-06:58:40] [I] Load engine: 
[08/31/2024-06:58:40] [I] Profiling verbosity: 0
[08/31/2024-06:58:40] [I] Tactic sources: Using default tactic sources
[08/31/2024-06:58:40] [I] timingCacheMode: local
[08/31/2024-06:58:40] [I] timingCacheFile: 
[08/31/2024-06:58:40] [I] Heuristic: Disabled
[08/31/2024-06:58:40] [I] Preview Features: Use default preview flags.
[08/31/2024-06:58:40] [I] MaxAuxStreams: -1
[08/31/2024-06:58:40] [I] BuilderOptimizationLevel: -1
[08/31/2024-06:58:40] [I] Input(s)s format: fp32:CHW
[08/31/2024-06:58:40] [I] Output(s)s format: fp32:CHW
[08/31/2024-06:58:40] [I] Input build shape: input=1x3x640x640+3x3x640x640+3x3x640x640
[08/31/2024-06:58:40] [I] Input calibration shapes: model
[08/31/2024-06:58:40] [I] === System Options ===
[08/31/2024-06:58:40] [I] Device: 0
[08/31/2024-06:58:40] [I] DLACore: 
[08/31/2024-06:58:40] [I] Plugins:
[08/31/2024-06:58:40] [I] setPluginsToSerialize:
[08/31/2024-06:58:40] [I] dynamicPlugins:
[08/31/2024-06:58:40] [I] ignoreParsedPluginLibs: 0
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] === Inference Options ===
[08/31/2024-06:58:40] [I] Batch: Explicit
[08/31/2024-06:58:40] [I] Input inference shape: input=3x3x640x640
[08/31/2024-06:58:40] [I] Iterations: 10
[08/31/2024-06:58:40] [I] Duration: 3s (+ 200ms warm up)
[08/31/2024-06:58:40] [I] Sleep time: 0ms
[08/31/2024-06:58:40] [I] Idle time: 0ms
[08/31/2024-06:58:40] [I] Inference Streams: 1
[08/31/2024-06:58:40] [I] ExposeDMA: Disabled
[08/31/2024-06:58:40] [I] Data transfers: Enabled
[08/31/2024-06:58:40] [I] Spin-wait: Disabled
[08/31/2024-06:58:40] [I] Multithreading: Disabled
[08/31/2024-06:58:40] [I] CUDA Graph: Disabled
[08/31/2024-06:58:40] [I] Separate profiling: Disabled
[08/31/2024-06:58:40] [I] Time Deserialize: Disabled
[08/31/2024-06:58:40] [I] Time Refit: Disabled
[08/31/2024-06:58:40] [I] NVTX verbosity: 0
[08/31/2024-06:58:40] [I] Persistent Cache Ratio: 0
[08/31/2024-06:58:40] [I] Inputs:
[08/31/2024-06:58:40] [I] === Reporting Options ===
[08/31/2024-06:58:40] [I] Verbose: Disabled
[08/31/2024-06:58:40] [I] Averages: 10 inferences
[08/31/2024-06:58:40] [I] Percentiles: 90,95,99
[08/31/2024-06:58:40] [I] Dump refittable layers:Disabled
[08/31/2024-06:58:40] [I] Dump output: Disabled
[08/31/2024-06:58:40] [I] Profile: Disabled
[08/31/2024-06:58:40] [I] Export timing to JSON file: 
[08/31/2024-06:58:40] [I] Export output to JSON file: 
[08/31/2024-06:58:40] [I] Export profile to JSON file: 
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] === Device Information ===
[08/31/2024-06:58:40] [I] Selected Device: Orin
[08/31/2024-06:58:40] [I] Compute Capability: 8.7
[08/31/2024-06:58:40] [I] SMs: 16
[08/31/2024-06:58:40] [I] Device Global Memory: 62841 MiB
[08/31/2024-06:58:40] [I] Shared Memory per SM: 164 KiB
[08/31/2024-06:58:40] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/31/2024-06:58:40] [I] Application Compute Clock Rate: 1.3 GHz
[08/31/2024-06:58:40] [I] Application Memory Clock Rate: 1.3 GHz
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/31/2024-06:58:40] [I] 
[08/31/2024-06:58:40] [I] TensorRT version: 8.6.2
[08/31/2024-06:58:40] [I] Loading standard plugins
[08/31/2024-06:58:40] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 33, GPU 5011 (MiB)
[08/31/2024-06:58:45] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1431, now: CPU 1223, GPU 6478 (MiB)
[08/31/2024-06:58:45] [I] Start parsing network model.
[08/31/2024-06:58:45] [I] [TRT] ----------------------------------------------------------------
[08/31/2024-06:58:45] [I] [TRT] Input filename:   /opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx
[08/31/2024-06:58:45] [I] [TRT] ONNX IR version:  0.0.10
[08/31/2024-06:58:45] [I] [TRT] Opset version:    16
[08/31/2024-06:58:45] [I] [TRT] Producer name:    onnx.quantize
[08/31/2024-06:58:45] [I] [TRT] Producer version: 0.1.0
[08/31/2024-06:58:45] [I] [TRT] Domain:           
[08/31/2024-06:58:45] [I] [TRT] Model version:    0
[08/31/2024-06:58:45] [I] [TRT] Doc string:       
[08/31/2024-06:58:45] [I] [TRT] ----------------------------------------------------------------
[08/31/2024-06:58:45] [W] [TRT] onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/31/2024-06:58:46] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[08/31/2024-06:58:46] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[08/31/2024-06:58:46] [I] Finished parsing network model. Parse time: 0.237448
[08/31/2024-06:58:46] [W] [TRT] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
[08/31/2024-06:58:46] [E] Error[4]: [network.cpp::validate::3040] Error Code 4: Internal Error (Int8 precision has been set for a layer or layer output, but int8 is not configured in the builder)
[08/31/2024-06:58:46] [E] Engine could not be created from network
[08/31/2024-06:58:46] [E] Building engine failed
[08/31/2024-06:58:46] [E] Failed to create engine from model or file.
[08/31/2024-06:58:46] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant.onnx --saveEngine=/opt/nvidia/deepstream/deepstream-7.0/samples/models/smart_warehouse/quant_3.engine --minShapes=input:1x3x640x640 --optShapes=input:3x3x640x640 --maxShapes=input:3x3x640x640

Hi,

Try to optimize your model with trtexec(TensorRT command line wrapper) command.

Thanks for your response.

Could you please spell out the command for optimizing the model using trtexec?
We tried building the model engine using tensorrt but that didn’t work. We searched for documentation detailing the process of quantization of a custom yolov8 .onnx models end to end but all we could find were fragments of information here and there, with the methods we found to be deprecated or not working in conjunction with trtexec.

Like the one we have detailed on this earlier response on this post, we referred to the documentation at TensorRT-Model-Optimizer/onnx_ptq at main · NVIDIA/TensorRT-Model-Optimizer · GitHub to build a calibration file and a quantized model, and then used tensorrt to try to build the engine which gave the following error:

Is the process we’re following to build the calibration file, the quantized model, and the model engine correct, or is there a better method out there? If not where in the process we are following are we going wrong?

Hi,

I think there is a problem with your ONNX model parsing. will you be able to share(personally) your model so that I can test and get back to you.