Why does RESIZE and CONCAT cause a lot of latency when using QDQ INT8 quantization in TensorRT?

Description

Dear community,

I am using INT8 quantization with QDQ nodes in TensorRT but am running into issues optimizing the performance, especially around resize operations.

I have optimized away a lot of problems around fusing add operations. However when it comes to concat, I’m unable to achieve good optimizations. I tried unifying the quantization scales of the concat inputs but didn’t see an improvement. I also tried wrapping the resize in a QDQ node but that didn’t help either.

Below I’ve attached some graphs showing TRT INT8 acceleration without QDQ inserted versus with QDQ inserted. This resize operation is causing a lot of latency, to the point where QDQ is actually slower than FP16.

I’m very puzzled by this and would appreciate any advice! Thank you.

Environment

TensorRT Version: 8.6.1
GPU Type: 3060
CUDA Version: 12.2

[[ qdq-resize4-layertime ]] resize4 output to concat cost a lot time

[[ qdq-resize4-doublecopy]] has two copy node

[[ qdq-resize4-copy-tactcic ]] is normal

[[ qdq-resize4-copy-tactcic ]] is normal too,but final cost a lot time

[[ qdq-resize4-copy-layerinfo ]]

[[ qdq-resize4-onnx ]]

[[ qdq-resize4-torch ]]

[[ qdq-resize4-trt build info ]]

below are picture about model without QDQ

[[ origin-resize4-layertime]] is normal !!

[[ origin-resize4-layerinfo]]

[[ origin-resize4-tactcic ]]

[[ origin-resize4-build-info ]]

[[ origin-resize4-onnx ]]

Hi @FusionYu ,
Can you please share the repro steps and model with us, so that we can try it on our end and assist you better.

Thanks