NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference

jwitsoe · December 18, 2024, 5:31pm

Originally published at: NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference | NVIDIA Technical Blog

Recurrent drafting (referred as ReDrafter) is a novel speculative decoding technique developed and open-sourced by Apple for large language model (LLM) inference now available with NVIDIA TensorRT-LLM. ReDrafter helps developers significantly boost LLM workload performance on NVIDIA GPUs. NVIDIA TensorRT-LLM is a library for optimizing LLM inference. It provides an easy-to-use Python API to define…

Topic		Replies	Views
NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching Technical Blog llama	1	24	December 11, 2024
NVIDIA TensorRT-LLM, 인플라이트 배치로 인코더-디코더 모델 가속화 Technical Blog - South Korea llama	1	20	December 13, 2024
NVIDIA open sources parsers and plugins in TensorRT Technical Blog	0	264	August 21, 2022
TensorRT 4 Accelerates Neural Machine Translation, Recommenders, and Speech Technical Blog	0	380	August 25, 2020
Video: Introduction to Recurrent Neural Networks in TensorRT Technical Blog	1	379	January 5, 2020
TensorRT 3: Faster TensorFlow Inference and Volta Support Technical Blog	0	258	August 21, 2022
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	4	88	January 9, 2025
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1709	January 25, 2024
NVIDIA H100 GPU에서 대규모 언어 모델 추론을 강화하는 NVIDIA TensorRT-LLM Technical Blog - South Korea korean	0	615	September 22, 2023
Get Started with Generative AI Development for Windows PCs with NVIDIA RTX Technical Blog	8	734	March 21, 2024

NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference

Related topics