Boosting LLM Inference Speed Using Speculative Decoding in MLC-LLM on Nvidia Jetson AGX Orin

shahizat · November 23, 2024, 11:23pm

Greetings to all,

Below is a guide on how to increase token generation during the decoding phase using speculative decoding:

Target model: Llama-3.1-8B-Instruct-q0f16-MLC
Draft model: Llama-3.2-1B-Instruct-q4f16_1-MLC

Demo without speculative decoding

 mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q0f16-MLC \
     --mode server \
     --host 0.0.0.0 \
     --port 5000

Demo with speculative decoding

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q0f16-MLC \
   --additional-models  HF://mlc-ai/Llama-3.2-1B-Instruct-q4f16_1-MLC \
   --speculative-mode "small_draft" \
   --overrides "max_num_sequence=6"  \
   --mode server \
   --host 0.0.0.0 \
   --port 5000

Just FYI @dusty_nv !

Topic		Replies	Views
Speculative decoding using vLLM on the Nvidia Jetson AGX Orin 64GB dev kit Jetson Projects generative_ai , llama-31-8b-instruct , llama	0	237	March 9, 2025
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	4	187	January 9, 2025
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	267	February 3, 2025
SOTA inference speed using SGlang and EAGLE-3 speculative decoding on the NVIDIA Jetson AGX Orin Jetson Projects llama-31-8b-instruct , llama	2	1040	March 23, 2025
Question Regarding Draft Model Support AnythingLLM via NVIDIA NIM DGX Spark / GB10 nim , llama-31-8b-instruct , llama	5	159	January 2, 2026
An Introduction to Speculative Decoding for Reducing Latency in AI Inference Technical Blog	1	108	September 17, 2025
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	4361	August 28, 2024
AI 추론 지연 시간을 줄이기 위한 Speculative Decoding 소개 Technical Blog - South Korea	1	68	September 23, 2025
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	26116	May 10, 2024
RAGJet: Retrieval-Augmented Generation on Jetson Xavier AGX (LLaMA + FastApi) Jetson Projects cudnn , llama	0	99	September 3, 2024

Boosting LLM Inference Speed Using Speculative Decoding in MLC-LLM on Nvidia Jetson AGX Orin

Related topics