Description
Hi I tried to convert an onnx graph exported from fairseq transformer decoder with incremental decoding. I got warning “Myelin graph with multiple dynamic values may have poor performance if they differ” when building the engine. At runtime, I tested the engine against LibTorch. TensorRT engine was on par with LibTorch at its optimal dimension (of the profile I used) and was beaten by 15% at other input dimensions.
Is it about the TensorRT internal implementation? For transformer decoder without incremental decoding, there can be just a single input with time dimension being dynamic, which should be fine. But with incremental decoding, the incremental decoding cache (which also have dynamic a time dimension and it will have many such inputs depends on the number of transformer layers the model has) has also to be the input. I guess that’s the culprit of the bad performance. Any suggestions to deal with such graph? Does TensorRT have the plan to support faster inference with multiple dynamic inputs in the future?