Bad inference performance of Transformer decoder with incremental decoding

Description

Hi I tried to convert an onnx graph exported from fairseq transformer decoder with incremental decoding. I got warning “Myelin graph with multiple dynamic values may have poor performance if they differ” when building the engine. At runtime, I tested the engine against LibTorch. TensorRT engine was on par with LibTorch at its optimal dimension (of the profile I used) and was beaten by 15% at other input dimensions.

Is it about the TensorRT internal implementation? For transformer decoder without incremental decoding, there can be just a single input with time dimension being dynamic, which should be fine. But with incremental decoding, the incremental decoding cache (which also have dynamic a time dimension and it will have many such inputs depends on the number of transformer layers the model has) has also to be the input. I guess that’s the culprit of the bad performance. Any suggestions to deal with such graph? Does TensorRT have the plan to support faster inference with multiple dynamic inputs in the future?

Hi,

This is know issue related to implementation. This may be fixed in the future releases.
Maybe you can build different profiles for all important dimension as some workaround.

Thank you

1 Like

Hi,
Is there any updates for this?
I have similar issue. When using profile optimization with multiple dynamic values, this warning shows. In addition, from memory allocation and also performance point of view, converted models are not efficient at all (ONNX model counter part is more efficient).