Excessive inference time spent on attention_mask operations after ONNX→TensorRT conversion (DriveOS AGX Orin + TRT 8.6.15)

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.10.0
DRIVE OS 6.0.8.1
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
2.1.0
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Issue Description
I encountered a severe performance issue when converting a GPT-like model (with `attention_mask` input) from ONNX to TensorRT.

After successful conversion and deployment on **DriveOS (AGX Orin, TensorRT 8.6.15)**, I used **Nsight Systems (nsys)** to analyze runtime performance. Surprisingly, more than **90% of total inference time** is spent on `attention_mask` related operations, while actual matrix multiplications and other Transformer layers take very little time.

This behavior seems abnormal — I suspect the mask computation is not being efficiently fused or optimized in TensorRT.

Environment

  • Platform: NVIDIA Drive AGX Orin
  • DriveOS Version: (6.12.1)
  • TensorRT Version: 8.6.15
  • CUDA Version: (11.4)
  • Model type: GPT-like transformer (causal decoder)
  • Input: token embeddings + attention_mask + pask_kv_cache
  • Precision: FP16
  • Batch size: (1)
  • ONNX opset: 16

Steps to Reproduce

  1. Export model from PyTorch → ONNX (with attention_mask input).
  2. Convert ONNX to TensorRT engine:
/usr/src/tensorrt/bin/trtexec --onnx=test.onnx --saveEngine=test.engine --exportLayerInfo=test_layerinfo.log --profilingVerbosity=detailed --exportProfile=test_profile.log --separateProfileRun --duration=5 --streams=1 --useCudaGraph --fp16 --verbose
  1. Run inference and profile using Nsight Systems:
nsys profile --force-overwrite true -o test --trace=cuda,nvtx,cublas,cudnn  --stats=true \
        /usr/src/tensorrt/bin/trtexec  \
        --loadEngine=test.engine  \
       --iterations=10 --idleTime=500 --duration=0 --useSpinWait
  1. Observe runtime results — attention_mask related kernels dominate total time (>90%).

Expected Behavior

attention_mask operations should be lightweight (simple broadcast or add ops) and not dominate runtime.

Due to the attention_mask generating many intermediate (glue) operators, it may be one of the main bottlenecks during TensorRT inference.


Attachments

I’ve attached:

  • test.onnx
  • TensorRT conversion log (test.txt)
  • Nsight Systems trace (test.nsys-rep)

→ Upload as: attention_mask_perf_issue.zip


Questions

  1. Is this a known issue or regression in TRT 8.6.15 on DriveOS?
  2. Are there recommended practices to optimize or fuse attention_mask handling (e.g. using plugin or model rewrite)?

Additional Notes

Additionally, the attached test.onnx is part of the full model. I also attempted removing attention_mask while setting past_key_values_input as a dynamic input. However, after conversion, because of the dynamic input, the latency was still not ideal.

Would really appreciate any feedback or workaround — this issue currently blocks our deployment. Thanks a lot for your time and support!
onnx,log,nsys_log download link:
attention_mask_perf_issue.zip

This forum is exclusively for developers who are part of the NVIDIA DRIVE® AGX SDK Developer Program | NVIDIA Developer . To post in the forum, please use an account associated with your corporate or university email address. This helps us ensure that the forum remains a platform for verified members of the developer program.