TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x

jwitsoe · December 2, 2024, 11:09pm

Originally published at: TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x | NVIDIA Technical Blog

NVIDIA TensorRT-LLM support for speculative decoding now provides over 3x the speedup in total token throughput. TensorRT-LLM is an open-source library that provides blazing-fast inference support for numerous popular large language models (LLMs) on NVIDIA GPUs. By adding support for speculative decoding on single GPU and single-node multi-GPU, the library further expands its supported optimizations…

anjshah · December 2, 2024, 11:43pm

Leverage TensorRT-LLM to unlock 3.6x inference throughput on large language models as described in this post. If you have any questions or comments, please let us know!

chislett.ben · January 2, 2025, 4:02pm

Could you clarify why the “Run decoding” step for 405B is run with “mpirun -n 8”? The setup described above is for TP=4, so why does “run.py” need to run with 8 processes?

anjshah · January 9, 2025, 8:07pm

yes that’s a typo, thanks @chislett.ben ! We’ll have it updated.

Topic		Replies	Views
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	2	311	February 3, 2025
Boosting LLM Inference Speed Using Speculative Decoding in MLC-LLM on Nvidia Jetson AGX Orin Jetson Projects generative_ai , llama-31-8b-instruct , llama	0	309	November 23, 2024
NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference Technical Blog	0	78	December 18, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	2089	January 25, 2024
Speculative decoding using vLLM on the Nvidia Jetson AGX Orin 64GB dev kit Jetson Projects generative_ai , llama-31-8b-instruct , llama	0	302	March 9, 2025
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	61	4735	August 28, 2024
NVIDIA TensorRT-LLM, 인플라이트 배치로 인코더-디코더 모델 가속화 Technical Blog - South Korea llama	0	81	December 13, 2024
Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding Technical Blog	2	211	July 17, 2025
LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM Technical Blog nim	0	202	July 7, 2025
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	0	373	May 3, 2024

TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x

Related topics