Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other
Target Operating System
Linux
QNX
other
Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other
SDK Manager Version
2.1.0
other
Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other
Issue Description
I encountered a severe performance issue when converting a GPT-like model (with `attention_mask` input) from ONNX to TensorRT.
After successful conversion and deployment on **DriveOS (AGX Orin, TensorRT 8.6.15)**, I used **Nsight Systems (nsys)** to analyze runtime performance. Surprisingly, more than **90% of total inference time** is spent on `attention_mask` related operations, while actual matrix multiplications and other Transformer layers take very little time.
This behavior seems abnormal — I suspect the mask computation is not being efficiently fused or optimized in TensorRT.
Environment
- Platform: NVIDIA Drive AGX Orin
- DriveOS Version: (6.12.1)
- TensorRT Version: 8.6.15
- CUDA Version: (11.4)
- Model type: GPT-like transformer (causal decoder)
- Input: token embeddings + attention_mask + pask_kv_cache
- Precision: FP16
- Batch size: (1)
- ONNX opset: 16
Steps to Reproduce
- Export model from PyTorch → ONNX (with attention_mask input).
- Convert ONNX to TensorRT engine:
/usr/src/tensorrt/bin/trtexec --onnx=test.onnx --saveEngine=test.engine --exportLayerInfo=test_layerinfo.log --profilingVerbosity=detailed --exportProfile=test_profile.log --separateProfileRun --duration=5 --streams=1 --useCudaGraph --fp16 --verbose
- Run inference and profile using Nsight Systems:
nsys profile --force-overwrite true -o test --trace=cuda,nvtx,cublas,cudnn --stats=true \
/usr/src/tensorrt/bin/trtexec \
--loadEngine=test.engine \
--iterations=10 --idleTime=500 --duration=0 --useSpinWait
- Observe runtime results —
attention_maskrelated kernels dominate total time (>90%).
Expected Behavior
attention_mask operations should be lightweight (simple broadcast or add ops) and not dominate runtime.
Due to the attention_mask generating many intermediate (glue) operators, it may be one of the main bottlenecks during TensorRT inference.
Attachments
I’ve attached:
test.onnx- TensorRT conversion log (
test.txt) - Nsight Systems trace (
test.nsys-rep)
→ Upload as: attention_mask_perf_issue.zip
Questions
- Is this a known issue or regression in TRT 8.6.15 on DriveOS?
- Are there recommended practices to optimize or fuse
attention_maskhandling (e.g. using plugin or model rewrite)?
Additional Notes
Additionally, the attached test.onnx is part of the full model. I also attempted removing attention_mask while setting past_key_values_input as a dynamic input. However, after conversion, because of the dynamic input, the latency was still not ideal.
Would really appreciate any feedback or workaround — this issue currently blocks our deployment. Thanks a lot for your time and support!
onnx,log,nsys_log download link:
attention_mask_perf_issue.zip
