Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding

jwitsoe · February 14, 2025, 6:19pm

Originally published at: Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding | NVIDIA Technical Blog

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation. To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder, a family of…

mohammad.alomari · April 25, 2025, 8:55am

Hi - Thank you very much for the interesting article. You mentioned that

Qwen2.5-Coder models optimized with TensorRT-LLM have also been packaged as downloadable NVIDIA NIM microservices for ease of deployment

Unfortunately, I can’t find the qwen2.5-coder-32b-instruct model as a NIM that I can deploy locally on our H100 setup. Could you please advise? Thanks

Topic		Replies	Views
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	4	85	January 9, 2025
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	163	February 3, 2025
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	51	September 17, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1701	January 25, 2024
추론 성능 가속화하는 새로운 소프트웨어 TensorRT-LLM 출시 Technical Blog - South Korea korean	0	638	September 12, 2023
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	275	May 3, 2024
NVIDIA 플랫폼 전반에서 Llama 3.1 강화하기 Technical Blog - South Korea llama	1	21	August 2, 2024
엣지에서 클라우드로 가속화된 Llama 3.2 배포하기 Technical Blog - South Korea llama	1	26	September 30, 2024
NVIDIA TensorRT 10.0 Upgrades Usability, Performance, and AI Model Support Technical Blog	1	197	May 14, 2024
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	3588	August 28, 2024

Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding

Related topics