Where do TensorRT Plugin determine whether fuse qdq or not?

Description

Thanks for your help in advance!

I wrote a custom plugins to support int8 input, but the log of trtexec --verboseshows that a dq->my_op->q combination, which is expected to be fused as a my_op_int, is not fused together. The toy onnx model and trt model after fusion is shown below.


It is said that custom plugins can’t be fused with other operators. I am wondering whether the custom plugins can be fused with qdq to construct a custom int8 operator. If so, how can I create a custom qdq fusion? Or if there’s any way telling tensorrt whether fuse a qdq operator with custom plugin or not?

Thank you!

Hi,
Please refer to below links related custom plugin implementation and sample:

While IPluginV2 and IPluginV2Ext interfaces are still supported for backward compatibility with TensorRT 5.1 and 6.0.x respectively, however, we recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt or IPluginV2IOExt interfaces instead.

Thanks!

Hi,
Thanks for your reply! It seems that both references are not related to Q-DQ implementation.

My custom plugin is build based on IPluginV2DynamicExt, and has been tested by a python script (including self-defined dynamic range, as mentioned in ref1), following examples in trt-samples-for-hackathon-cn/cookbook/05-Plugin at master · NVIDIA/trt-samples-for-hackathon-cn · GitHub. It seems that the QDQ example in this repo is labeled with TODO.

Is there any other examples for implementing custom plugins with Q-DQ?

Plugins in TRT replace a group of layers with a proprietary implementations. You, the user, decide what functionality to include in the plugin and what to leave for TRT to handle.

The same idea follows for a TRT network with Q/DQ layers: if you want your plugin to get quantized int8 inputs and outputs, then you need to include the input DQ and output Q as part of your plugin and remove them from the network you define for TRT.

Let’s look at your example (ignoring weights quantization):
Q → DQ → Conv → Q → DQ → IntSoftMax → Q → DQ → Conv.

I’ll indicate TRT fusions using square brackets and show what you are getting today:
Q → [DQ → Conv → Q] → DQ_i → IntSoftMax → Q_o → [DQ → Conv].

I’m suggesting that you manually “fuse” DQ_i and Q_o with IntSoftMax so TRT will see a network like this:
Q → DQ → Conv → Q → IntSoftMax → DQ → Conv
Which it will fuse to:
Q → [DQ → Conv → Q] → IntSoftMax → [DQ → Conv].

When you “manually fuse” DQ_i you take the input quantization scale and give it to your plugin so it will know how to dequantize (if needed) the input. The same follows for using the scale from Q_o in order to quantize your plugin’s output.

1 Like

Thanks for your reply!
We tested this manually fusing method and it works.

Hi,
Although the manually fusing method works well on the above toy model, another bug reported when we apply it to DeiT.

The DeiT model with inserted QDQ module can be successfully built by trtexec. However, when we manually replace onnx.softmax with our custom int-softmax and manually remove the dq and q before and after custom int-softmax, the engine is not built successfully.

The above image shows a typical multi-head-attention module in Transformer with QDQ inserted. Softmaxs in all encoder layers are replaced by custom int-softmax.

The reported error is shown above. IntSoftmax_79 is the softmax in first layer, and QuantizeLinear_75 is the input quantizer of IntSoftmax_79. 1238 is an attr of gather(indices=2) of the first layer, and MatMul_218 is the matmul before Q->IntSoftmax in the second layer.

It is confusing that such a huge structure(1238…MatMul_218] is regarded as a single foreign node, which means that all dq in this node is neglected.

The DeiT with Q/DQ has been attached below. Do you have any suggestions for solving this problem?

Thanks a lot!

qDeiT.onnx (22.0 MB)

Hi @jiangstein,
Your ONNX model looks good to me and this looks like a bug in TensorRT.
I’ve tested your model with an internal dev version of TensorRT and it builds fine.
Which TensorRT version and hardware are you using?
Neta

Hi,
Thanks for your reply! The above onnx is only to show that the DeiT model without custom plugin and qdq modification can be built correctly. The problem I met is detailed by words.

The modified onnx model and the code for building custom intSoftmax plugin are packaged together in the following .tar file.

IntSoftmaxPlugin.tar (22.1 MB)

The IntSoftmaxPlugin.so can be built by make directly using TRT 8.4, and the config for running trtexec is shown below:

trtexec --onnx=qDeiT_intsoftmax_modified.onnx --int8 --fp16 --saveEngine=qsoftmax.onnx.engine --plugins=./IntSoftmaxPlugin.so --verbose --nvtxMode=verbose --timingCacheFile="./timing.cache"

Could you please test this model? the problem I met should be reproduced easily.

Thanks a lot for your sincerely help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.