TensorRT quantization bug on Jetpack 6.0

slimwangyue · January 17, 2024, 3:19pm

Here is the environment:

Software part of jetson-stats 4.2.4 - (c) 2024, Raffaello Bonghi
Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 DP [L4T 36.2.0]
NV Power Mode[2]: MODE_30W
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - 699-level Part Number: 699-13701-0005-500 M.0
 - P-Number: p3701-0005
 - Module: NVIDIA Jetson AGX Orin (64GB ram)
 - SoC: tegra234
 - CUDA Arch BIN: 8.7
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 22.04 Jammy Jellyfish
 - Release: 5.15.122-tegra
 - Python: 3.10.12
jtop:
 - Version: 4.2.4
 - Service: Active
Libraries:
 - CUDA: 12.2.140
 - cuDNN: 8.9.4.25
 - TensorRT: 8.6.2.3
 - VPI: 3.0.10
 - Vulkan: 1.3.204
 - OpenCV: 4.8.0 - with CUDA: NO

The model I am working on is a segmentation model with a transformer module.
I first did quantization aware training on the model and then generated the onnx model.
When I convert the generated onnx model to trt with the command

/usr/src/tensorrt/bin/trtexec --onnx=/home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx --int8 --fp16 --profilingVerbosity=detailed

I had this error log

&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --onnx=/home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx --int8 --fp16 --profilingVerbosity=detailed
[01/17/2024-15:05:14] [I] === Model Options ===
[01/17/2024-15:05:14] [I] Format: ONNX
[01/17/2024-15:05:14] [I] Model: /home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx
[01/17/2024-15:05:14] [I] Output:
[01/17/2024-15:05:14] [I] === Build Options ===
[01/17/2024-15:05:14] [I] Max batch: explicit batch
[01/17/2024-15:05:14] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[01/17/2024-15:05:14] [I] minTiming: 1
[01/17/2024-15:05:14] [I] avgTiming: 8
[01/17/2024-15:05:14] [I] Precision: FP32+FP16+INT8
[01/17/2024-15:05:14] [I] LayerPrecisions: 
[01/17/2024-15:05:14] [I] Layer Device Types: 
[01/17/2024-15:05:14] [I] Calibration: Dynamic
[01/17/2024-15:05:14] [I] Refit: Disabled
[01/17/2024-15:05:14] [I] Version Compatible: Disabled
[01/17/2024-15:05:14] [I] ONNX Native InstanceNorm: Disabled
[01/17/2024-15:05:14] [I] TensorRT runtime: full
[01/17/2024-15:05:14] [I] Lean DLL Path: 
[01/17/2024-15:05:14] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[01/17/2024-15:05:14] [I] Exclude Lean Runtime: Disabled
[01/17/2024-15:05:14] [I] Sparsity: Disabled
[01/17/2024-15:05:14] [I] Safe mode: Disabled
[01/17/2024-15:05:14] [I] Build DLA standalone loadable: Disabled
[01/17/2024-15:05:14] [I] Allow GPU fallback for DLA: Disabled
[01/17/2024-15:05:14] [I] DirectIO mode: Disabled
[01/17/2024-15:05:14] [I] Restricted mode: Disabled
[01/17/2024-15:05:14] [I] Skip inference: Disabled
[01/17/2024-15:05:14] [I] Save engine: 
[01/17/2024-15:05:14] [I] Load engine: 
[01/17/2024-15:05:14] [I] Profiling verbosity: 2
[01/17/2024-15:05:14] [I] Tactic sources: Using default tactic sources
[01/17/2024-15:05:14] [I] timingCacheMode: local
[01/17/2024-15:05:14] [I] timingCacheFile: 
[01/17/2024-15:05:14] [I] Heuristic: Disabled
[01/17/2024-15:05:14] [I] Preview Features: Use default preview flags.
[01/17/2024-15:05:14] [I] MaxAuxStreams: -1
[01/17/2024-15:05:14] [I] BuilderOptimizationLevel: -1
[01/17/2024-15:05:14] [I] Input(s)s format: fp32:CHW
[01/17/2024-15:05:14] [I] Output(s)s format: fp32:CHW
[01/17/2024-15:05:14] [I] Input build shapes: model
[01/17/2024-15:05:14] [I] Input calibration shapes: model
[01/17/2024-15:05:14] [I] === System Options ===
[01/17/2024-15:05:14] [I] Device: 0
[01/17/2024-15:05:14] [I] DLACore: 
[01/17/2024-15:05:14] [I] Plugins:
[01/17/2024-15:05:14] [I] setPluginsToSerialize:
[01/17/2024-15:05:14] [I] dynamicPlugins:
[01/17/2024-15:05:14] [I] ignoreParsedPluginLibs: 0
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] === Inference Options ===
[01/17/2024-15:05:14] [I] Batch: Explicit
[01/17/2024-15:05:14] [I] Input inference shapes: model
[01/17/2024-15:05:14] [I] Iterations: 10
[01/17/2024-15:05:14] [I] Duration: 3s (+ 200ms warm up)
[01/17/2024-15:05:14] [I] Sleep time: 0ms
[01/17/2024-15:05:14] [I] Idle time: 0ms
[01/17/2024-15:05:14] [I] Inference Streams: 1
[01/17/2024-15:05:14] [I] ExposeDMA: Disabled
[01/17/2024-15:05:14] [I] Data transfers: Enabled
[01/17/2024-15:05:14] [I] Spin-wait: Disabled
[01/17/2024-15:05:14] [I] Multithreading: Disabled
[01/17/2024-15:05:14] [I] CUDA Graph: Disabled
[01/17/2024-15:05:14] [I] Separate profiling: Disabled
[01/17/2024-15:05:14] [I] Time Deserialize: Disabled
[01/17/2024-15:05:14] [I] Time Refit: Disabled
[01/17/2024-15:05:14] [I] NVTX verbosity: 2
[01/17/2024-15:05:14] [I] Persistent Cache Ratio: 0
[01/17/2024-15:05:14] [I] Inputs:
[01/17/2024-15:05:14] [I] === Reporting Options ===
[01/17/2024-15:05:14] [I] Verbose: Disabled
[01/17/2024-15:05:14] [I] Averages: 10 inferences
[01/17/2024-15:05:14] [I] Percentiles: 90,95,99
[01/17/2024-15:05:14] [I] Dump refittable layers:Disabled
[01/17/2024-15:05:14] [I] Dump output: Disabled
[01/17/2024-15:05:14] [I] Profile: Disabled
[01/17/2024-15:05:14] [I] Export timing to JSON file: 
[01/17/2024-15:05:14] [I] Export output to JSON file: 
[01/17/2024-15:05:14] [I] Export profile to JSON file: 
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] === Device Information ===
[01/17/2024-15:05:14] [I] Selected Device: Orin
[01/17/2024-15:05:14] [I] Compute Capability: 8.7
[01/17/2024-15:05:14] [I] SMs: 8
[01/17/2024-15:05:14] [I] Device Global Memory: 62841 MiB
[01/17/2024-15:05:14] [I] Shared Memory per SM: 164 KiB
[01/17/2024-15:05:14] [I] Memory Bus Width: 256 bits (ECC disabled)
[01/17/2024-15:05:14] [I] Application Compute Clock Rate: 1.3 GHz
[01/17/2024-15:05:14] [I] Application Memory Clock Rate: 0.612 GHz
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] TensorRT version: 8.6.2
[01/17/2024-15:05:14] [I] Loading standard plugins
[01/17/2024-15:05:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 33, GPU 12702 (MiB)
[01/17/2024-15:05:20] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1114, now: CPU 1223, GPU 13851 (MiB)
[01/17/2024-15:05:20] [I] Start parsing network model.
[01/17/2024-15:05:20] [I] [TRT] ----------------------------------------------------------------
[01/17/2024-15:05:20] [I] [TRT] Input filename:   /home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx
[01/17/2024-15:05:20] [I] [TRT] ONNX IR version:  0.0.8
[01/17/2024-15:05:20] [I] [TRT] Opset version:    17
[01/17/2024-15:05:20] [I] [TRT] Producer name:    pytorch
[01/17/2024-15:05:20] [I] [TRT] Producer version: 2.2.0
[01/17/2024-15:05:20] [I] [TRT] Domain:           
[01/17/2024-15:05:20] [I] [TRT] Model version:    0
[01/17/2024-15:05:20] [I] [TRT] Doc string:       
[01/17/2024-15:05:20] [I] [TRT] ----------------------------------------------------------------
[01/17/2024-15:05:20] [W] [TRT] onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/17/2024-15:05:20] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[01/17/2024-15:05:20] [I] Finished parsing network model. Parse time: 0.342161
[01/17/2024-15:05:20] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.
[01/17/2024-15:05:21] [I] [TRT] Graph optimization time: 0.64682 seconds.
[01/17/2024-15:05:21] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
**[01/17/2024-15:05:21] [E] Error[10]: Could not find any implementation for node [trainStation1].**
**[01/17/2024-15:05:21] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node [trainStation1].)**
[01/17/2024-15:05:21] [E] Engine could not be created from network
[01/17/2024-15:05:21] [E] Building engine failed
[01/17/2024-15:05:21] [E] Failed to create engine from model or file.
[01/17/2024-15:05:21] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --onnx=/home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx --int8 --fp16 --profilingVerbosity=detailed

I have no idea why TRT inner build node trainStation1 not finding implementation. Could you help with this? Thanks!
Btw, I also tried the same TRT conversion with the same onnx model on a GPU machine (TensorRT 8.6 as well), and everything works just fine, which means not the onnx model’s problem.
I also upload the onnx model for your to reproduce.
qat_544_V5.0.zip (81.8 MB)

AastaLLL · January 18, 2024, 3:20am

Hi,

This is a known issue on JetPack 6DP.

Our internal team is working on this.
Will update more info with you later.

Thanks.

slimwangyue · January 18, 2024, 2:18pm

Thank you!

AastaLLL · January 19, 2024, 3:33am

Hi,

We have confirmed that this issue is fixed in our internal JetPack 6 GA release.
You can find our software release roadmap below:

Thanks.

slimwangyue · January 19, 2024, 1:47pm

Thanks for letting me know. Is there a workaround? Or we must wait until March?

AastaLLL · January 22, 2024, 7:56am

Hi,

Unfortunately, we don’t have a known WAR right now.
Will let you know if we find a way to avoid this issue.

Thanks.

system · February 14, 2024, 5:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Keras->Onnx->TensorRT Jetson AGX Orin tensorrt	4	54	September 25, 2024
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1380	July 12, 2022
Issue with yolov8s-seg conversion from onnx to engine Jetson AGX Orin yolo , onnx	5	28	October 29, 2024
TensorRT problem on NVIDIA APEX ORIN NX TensorRT tensorrt , jetson-inference , cudnn	1	23	August 29, 2024
Error loading .trt model Jetson AGX Orin tensorrt	7	61	November 6, 2024
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1006	September 28, 2022
ERORR with ONNX2TRT : Unknown embedded device detected Jetson Xavier NX onnx	18	4480	April 27, 2022
Erorr with onnx to trt Jetson Xavier NX tensorrt	8	1230	March 30, 2022
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	664	April 30, 2024
TensorRT 8.6 not running properly on Orin NX with Jetpack 6 Jetson AGX Orin tensorrt , generative_ai	6	939	December 25, 2023

TensorRT quantization bug on Jetpack 6.0

Related topics