Significant latency regression when compiling deformable attention layers in TensorRT 10.5

Description

There is a significant latency regression in a standard deformable attention layer when upgrading from TRT 8.6.1.6 to TRT 10.5.0.18.

Benchmarking the same compiled ONNX model with trtexec shows a noticeable regression from TRT 8.6 to TRT 10.5.

TRT 8.6:

[I] GPU Compute Time: min = 1.84009 ms, max = 2.03879 ms, mean = 1.90564 ms, median = 1.90161 ms, percentile(90%) = 1.91577 ms, percentile(95%) = 1.92102 ms, percentile(99%) = 2.03264 ms
[I] Total GPU Compute Time: 1.84276 s

TRT 10.5:

[I] GPU Compute Time: min = 3.37921 ms, max = 3.78674 ms, mean = 3.44999 ms, median = 3.44165 ms, percentile(90%) = 3.52563 ms, percentile(95%) = 3.53076 ms, percentile(99%) = 3.55737 ms
[I] Total GPU Compute Time: 2.79104 s

Environment

TensorRT Version: 10.5.0.18
GPU Type: A4000
Nvidia Driver Version: 535.183.01
CUDA Version: 12.2
CUDNN Version: 8.9
Operating System + Version: Linux x86, Ubuntu 20.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): N/A

Relevant Files

Simple python file to reproduce our TensorRT compilation process: deformable_attn_compile.py · GitHub

ONNX & TRT engine files:
trt_bug_report2.zip (3.5 MB)

  • deformable_attention_10k_queries.onnx: Deformable attention layer ONNX model file. Opset 17, ONNX version 1.16.0
  • deformable_attention_10k_queries_trt105.trt: Deformable attention layer TRT engine for TRT 10.5.0.18.
  • deformable_attention_10k_queries_trt86.trt: Deformable attention layer TRT engine TRT 8.6.1.6.

Steps To Reproduce

The provided python file can be used to compile the the onnx models. It uses the standard TensorRT python import and has no other dependencies other than python 3.10. We have already compiled the two engines for you for both TRT 10.5 and 8.6 and included them above.

# Same code can be used for both TRT 8.6 & TRT 10.5 python versions.
python3 deformable_attn_compile.py

Benchmarking with trtexec shows the slowdown.

# TRT 8.6 benchmark
trtexec --loadEngine=deformable_attention_10k_queries_trt86.trt


# TRT 10.5 benchmark
trtexec --loadEngine=deformable_attention_10k_queries_trt105.trt

Hi @aboubezari ,
Can you pls sahre the onnx model with us?

Thanks