NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching

Originally published at: NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching | NVIDIA Technical Blog

NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following:  Decoder-only models, such as Llama 3.1 Mixture-of-experts (MoE) models, such as Mixtral Selective state-space models (SSM), such as Mamba Multimodal models for vision-language and video-language applications The addition of…