what is the best way to convert a combined encoder decoder transformer model to tensorrt. The model is used by calling model.encode() and model.decode(), which is different then the typical forward pass that is supported. Also, once the model is passed through torch_tensorrt.compile, what is the expected type of the output converted model? What do we do if the actual converted model runs slower than the original unconverted model? Is there a docker container image that exists for jetson that has tensorrt, jetpack 6.2, and tensorrt-llm?
- Are there strategies to reduce memory usage during conversion like incremental optimizations?
- Whether dynamic shapes affect optimization performance?
- Best practices for custom operations like a specialized patch embedding layer or rotary positional encoding.