I am using INT8 quantization with QDQ nodes in TensorRT but am running into issues optimizing the performance, especially around resize operations.
I have optimized away a lot of problems around fusing add operations. However when it comes to concat, I’m unable to achieve good optimizations. I tried unifying the quantization scales of the concat inputs but didn’t see an improvement. I also tried wrapping the resize in a QDQ node but that didn’t help either.
Below I’ve attached some graphs showing TRT INT8 acceleration without QDQ inserted versus with QDQ inserted. This resize operation is causing a lot of latency, to the point where QDQ is actually slower than FP16.
I’m very puzzled by this and would appreciate any advice! Thank you.
TensorRT Version: 8.6.1
GPU Type: 3060
CUDA Version: 12.2
[[ qdq-resize4-layertime ]] resize4 output to concat cost a lot time
[[ qdq-resize4-doublecopy]] has two copy node
[[ qdq-resize4-copy-tactcic ]] is normal
[[ qdq-resize4-copy-tactcic ]] is normal too,but final cost a lot time
[[ qdq-resize4-copy-layerinfo ]]
[[ qdq-resize4-onnx ]]
[[ qdq-resize4-torch ]]
[[ qdq-resize4-trt build info ]]
below are picture about model without QDQ
[[ origin-resize4-layertime]] is normal !!
[[ origin-resize4-tactcic ]]
[[ origin-resize4-build-info ]]
[[ origin-resize4-onnx ]]