Bad inference performance of Transformer decoder with incremental decoding

Description

Hi I tried to convert an onnx graph exported from fairseq transformer decoder with incremental decoding. I got warning “Myelin graph with multiple dynamic values may have poor performance if they differ” when building the engine. At runtime, I tested the engine against LibTorch. TensorRT engine was on par with LibTorch at its optimal dimension (of the profile I used) and was beaten by 15% at other input dimensions.

Is it about the TensorRT internal implementation? For transformer decoder without incremental decoding, there can be just a single input with time dimension being dynamic, which should be fine. But with incremental decoding, the incremental decoding cache (which also have dynamic a time dimension and it will have many such inputs depends on the number of transformer layers the model has) has also to be the input. I guess that’s the culprit of the bad performance. Any suggestions to deal with such graph? Does TensorRT have the plan to support faster inference with multiple dynamic inputs in the future?

Hi,

This is know issue related to implementation. This may be fixed in the future releases.
Maybe you can build different profiles for all important dimension as some workaround.

Thank you

1 Like