Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding

Originally published at: Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding | NVIDIA Technical Blog

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation. To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder, a family of…

Hi - Thank you very much for the interesting article. You mentioned that

Qwen2.5-Coder models optimized with TensorRT-LLM have also been packaged as downloadable NVIDIA NIM microservices for ease of deployment

Unfortunately, I can’t find the qwen2.5-coder-32b-instruct model as a NIM that I can deploy locally on our H100 setup. Could you please advise? Thanks