Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding

jwitsoe · December 17, 2024, 5:00pm

Originally published at: Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding | NVIDIA Technical Blog

Meta’s Llama collection of open large language models (LLMs) continues to grow with the recent addition of Llama 3.3 70B, a text-only instruction-tuned model. Llama 3.3 provides enhanced performance respective to the older Llama 3.1 70B model and can even match the capabilities of the larger, more computationally expensive Llama 3.1 405B model on several tasks including…

skreddybojja · January 9, 2025, 12:44pm

@jwitsoe , Will you please share the steps to benchmark the target engine (built as per the steps given in blog) . In the Blog , steps are missing to benchmark target engine. We can compare/validate the results if we have those. Your response would be greatly appreciated.

mrakgr · February 3, 2025, 4:44pm

The commands in the blog post cannot be replicated. The code is broken as given.

Topic		Replies	Views
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	4	187	January 9, 2025
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	365	September 17, 2024
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	4359	August 28, 2024
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	119	September 17, 2024
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	347	May 3, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	9	1990	January 28, 2026
LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM Technical Blog nim	1	145	July 7, 2025
NVIDIA TensorRT-LLM Now Supports Recurrent Drafting for Optimizing LLM Inference Technical Blog	1	58	December 18, 2024
Boosting LLM Inference Speed Using Speculative Decoding in MLC-LLM on Nvidia Jetson AGX Orin Jetson Projects generative_ai , llama-31-8b-instruct , llama	0	267	November 23, 2024
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B Technical Blog llama	3	136	October 24, 2024

Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding

Related topics