TensorRT quantization bug on Jetpack 6.0

Here is the environment:

Software part of jetson-stats 4.2.4 - (c) 2024, Raffaello Bonghi
Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 DP [L4T 36.2.0]
NV Power Mode[2]: MODE_30W
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - 699-level Part Number: 699-13701-0005-500 M.0
 - P-Number: p3701-0005
 - Module: NVIDIA Jetson AGX Orin (64GB ram)
 - SoC: tegra234
 - CUDA Arch BIN: 8.7
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 22.04 Jammy Jellyfish
 - Release: 5.15.122-tegra
 - Python: 3.10.12
jtop:
 - Version: 4.2.4
 - Service: Active
Libraries:
 - CUDA: 12.2.140
 - cuDNN: 8.9.4.25
 - TensorRT: 8.6.2.3
 - VPI: 3.0.10
 - Vulkan: 1.3.204
 - OpenCV: 4.8.0 - with CUDA: NO

The model I am working on is a segmentation model with a transformer module.
I first did quantization aware training on the model and then generated the onnx model.
When I convert the generated onnx model to trt with the command

/usr/src/tensorrt/bin/trtexec --onnx=/home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx --int8 --fp16 --profilingVerbosity=detailed

I had this error log

&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --onnx=/home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx --int8 --fp16 --profilingVerbosity=detailed
[01/17/2024-15:05:14] [I] === Model Options ===
[01/17/2024-15:05:14] [I] Format: ONNX
[01/17/2024-15:05:14] [I] Model: /home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx
[01/17/2024-15:05:14] [I] Output:
[01/17/2024-15:05:14] [I] === Build Options ===
[01/17/2024-15:05:14] [I] Max batch: explicit batch
[01/17/2024-15:05:14] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[01/17/2024-15:05:14] [I] minTiming: 1
[01/17/2024-15:05:14] [I] avgTiming: 8
[01/17/2024-15:05:14] [I] Precision: FP32+FP16+INT8
[01/17/2024-15:05:14] [I] LayerPrecisions: 
[01/17/2024-15:05:14] [I] Layer Device Types: 
[01/17/2024-15:05:14] [I] Calibration: Dynamic
[01/17/2024-15:05:14] [I] Refit: Disabled
[01/17/2024-15:05:14] [I] Version Compatible: Disabled
[01/17/2024-15:05:14] [I] ONNX Native InstanceNorm: Disabled
[01/17/2024-15:05:14] [I] TensorRT runtime: full
[01/17/2024-15:05:14] [I] Lean DLL Path: 
[01/17/2024-15:05:14] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[01/17/2024-15:05:14] [I] Exclude Lean Runtime: Disabled
[01/17/2024-15:05:14] [I] Sparsity: Disabled
[01/17/2024-15:05:14] [I] Safe mode: Disabled
[01/17/2024-15:05:14] [I] Build DLA standalone loadable: Disabled
[01/17/2024-15:05:14] [I] Allow GPU fallback for DLA: Disabled
[01/17/2024-15:05:14] [I] DirectIO mode: Disabled
[01/17/2024-15:05:14] [I] Restricted mode: Disabled
[01/17/2024-15:05:14] [I] Skip inference: Disabled
[01/17/2024-15:05:14] [I] Save engine: 
[01/17/2024-15:05:14] [I] Load engine: 
[01/17/2024-15:05:14] [I] Profiling verbosity: 2
[01/17/2024-15:05:14] [I] Tactic sources: Using default tactic sources
[01/17/2024-15:05:14] [I] timingCacheMode: local
[01/17/2024-15:05:14] [I] timingCacheFile: 
[01/17/2024-15:05:14] [I] Heuristic: Disabled
[01/17/2024-15:05:14] [I] Preview Features: Use default preview flags.
[01/17/2024-15:05:14] [I] MaxAuxStreams: -1
[01/17/2024-15:05:14] [I] BuilderOptimizationLevel: -1
[01/17/2024-15:05:14] [I] Input(s)s format: fp32:CHW
[01/17/2024-15:05:14] [I] Output(s)s format: fp32:CHW
[01/17/2024-15:05:14] [I] Input build shapes: model
[01/17/2024-15:05:14] [I] Input calibration shapes: model
[01/17/2024-15:05:14] [I] === System Options ===
[01/17/2024-15:05:14] [I] Device: 0
[01/17/2024-15:05:14] [I] DLACore: 
[01/17/2024-15:05:14] [I] Plugins:
[01/17/2024-15:05:14] [I] setPluginsToSerialize:
[01/17/2024-15:05:14] [I] dynamicPlugins:
[01/17/2024-15:05:14] [I] ignoreParsedPluginLibs: 0
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] === Inference Options ===
[01/17/2024-15:05:14] [I] Batch: Explicit
[01/17/2024-15:05:14] [I] Input inference shapes: model
[01/17/2024-15:05:14] [I] Iterations: 10
[01/17/2024-15:05:14] [I] Duration: 3s (+ 200ms warm up)
[01/17/2024-15:05:14] [I] Sleep time: 0ms
[01/17/2024-15:05:14] [I] Idle time: 0ms
[01/17/2024-15:05:14] [I] Inference Streams: 1
[01/17/2024-15:05:14] [I] ExposeDMA: Disabled
[01/17/2024-15:05:14] [I] Data transfers: Enabled
[01/17/2024-15:05:14] [I] Spin-wait: Disabled
[01/17/2024-15:05:14] [I] Multithreading: Disabled
[01/17/2024-15:05:14] [I] CUDA Graph: Disabled
[01/17/2024-15:05:14] [I] Separate profiling: Disabled
[01/17/2024-15:05:14] [I] Time Deserialize: Disabled
[01/17/2024-15:05:14] [I] Time Refit: Disabled
[01/17/2024-15:05:14] [I] NVTX verbosity: 2
[01/17/2024-15:05:14] [I] Persistent Cache Ratio: 0
[01/17/2024-15:05:14] [I] Inputs:
[01/17/2024-15:05:14] [I] === Reporting Options ===
[01/17/2024-15:05:14] [I] Verbose: Disabled
[01/17/2024-15:05:14] [I] Averages: 10 inferences
[01/17/2024-15:05:14] [I] Percentiles: 90,95,99
[01/17/2024-15:05:14] [I] Dump refittable layers:Disabled
[01/17/2024-15:05:14] [I] Dump output: Disabled
[01/17/2024-15:05:14] [I] Profile: Disabled
[01/17/2024-15:05:14] [I] Export timing to JSON file: 
[01/17/2024-15:05:14] [I] Export output to JSON file: 
[01/17/2024-15:05:14] [I] Export profile to JSON file: 
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] === Device Information ===
[01/17/2024-15:05:14] [I] Selected Device: Orin
[01/17/2024-15:05:14] [I] Compute Capability: 8.7
[01/17/2024-15:05:14] [I] SMs: 8
[01/17/2024-15:05:14] [I] Device Global Memory: 62841 MiB
[01/17/2024-15:05:14] [I] Shared Memory per SM: 164 KiB
[01/17/2024-15:05:14] [I] Memory Bus Width: 256 bits (ECC disabled)
[01/17/2024-15:05:14] [I] Application Compute Clock Rate: 1.3 GHz
[01/17/2024-15:05:14] [I] Application Memory Clock Rate: 0.612 GHz
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[01/17/2024-15:05:14] [I] 
[01/17/2024-15:05:14] [I] TensorRT version: 8.6.2
[01/17/2024-15:05:14] [I] Loading standard plugins
[01/17/2024-15:05:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 33, GPU 12702 (MiB)
[01/17/2024-15:05:20] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1114, now: CPU 1223, GPU 13851 (MiB)
[01/17/2024-15:05:20] [I] Start parsing network model.
[01/17/2024-15:05:20] [I] [TRT] ----------------------------------------------------------------
[01/17/2024-15:05:20] [I] [TRT] Input filename:   /home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx
[01/17/2024-15:05:20] [I] [TRT] ONNX IR version:  0.0.8
[01/17/2024-15:05:20] [I] [TRT] Opset version:    17
[01/17/2024-15:05:20] [I] [TRT] Producer name:    pytorch
[01/17/2024-15:05:20] [I] [TRT] Producer version: 2.2.0
[01/17/2024-15:05:20] [I] [TRT] Domain:           
[01/17/2024-15:05:20] [I] [TRT] Model version:    0
[01/17/2024-15:05:20] [I] [TRT] Doc string:       
[01/17/2024-15:05:20] [I] [TRT] ----------------------------------------------------------------
[01/17/2024-15:05:20] [W] [TRT] onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/17/2024-15:05:20] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[01/17/2024-15:05:20] [I] Finished parsing network model. Parse time: 0.342161
[01/17/2024-15:05:20] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.
[01/17/2024-15:05:21] [I] [TRT] Graph optimization time: 0.64682 seconds.
[01/17/2024-15:05:21] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
**[01/17/2024-15:05:21] [E] Error[10]: Could not find any implementation for node [trainStation1].**
**[01/17/2024-15:05:21] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node [trainStation1].)**
[01/17/2024-15:05:21] [E] Engine could not be created from network
[01/17/2024-15:05:21] [E] Building engine failed
[01/17/2024-15:05:21] [E] Failed to create engine from model or file.
[01/17/2024-15:05:21] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --onnx=/home/orin-2/yue/vision_transformer_optimization/export/qat_544_V5.0.onnx --int8 --fp16 --profilingVerbosity=detailed

I have no idea why TRT inner build node trainStation1 not finding implementation. Could you help with this? Thanks!
Btw, I also tried the same TRT conversion with the same onnx model on a GPU machine (TensorRT 8.6 as well), and everything works just fine, which means not the onnx model’s problem.
I also upload the onnx model for your to reproduce.
qat_544_V5.0.zip (81.8 MB)

Hi,

This is a known issue on JetPack 6DP.

Our internal team is working on this.
Will update more info with you later.

Thanks.

Thank you!

Hi,

We have confirmed that this issue is fixed in our internal JetPack 6 GA release.
You can find our software release roadmap below:

Thanks.

Thanks for letting me know. Is there a workaround? Or we must wait until March?

Hi,

Unfortunately, we don’t have a known WAR right now.
Will let you know if we find a way to avoid this issue.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.