Boosting LLM Inference Speed Using Speculative Decoding in MLC-LLM on Nvidia Jetson AGX Orin

Greetings to all,

Below is a guide on how to increase token generation during the decoding phase using speculative decoding:

  • Target model: Llama-3.1-8B-Instruct-q0f16-MLC
  • Draft model: Llama-3.2-1B-Instruct-q4f16_1-MLC

Demo without speculative decoding

 mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q0f16-MLC \
     --mode server \
     --host 0.0.0.0 \
     --port 5000

Demo with speculative decoding

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q0f16-MLC \
   --additional-models  HF://mlc-ai/Llama-3.2-1B-Instruct-q4f16_1-MLC \
   --speculative-mode "small_draft" \
   --overrides "max_num_sequence=6"  \
   --mode server \
   --host 0.0.0.0 \
   --port 5000

Just FYI @dusty_nv !

1 Like