Greetings to all,
Below is a guide on how to increase token generation during the decoding phase using speculative decoding:
- Target model: Llama-3.1-8B-Instruct-q0f16-MLC
- Draft model: Llama-3.2-1B-Instruct-q4f16_1-MLC
Demo without speculative decoding
mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q0f16-MLC \
--mode server \
--host 0.0.0.0 \
--port 5000
Demo with speculative decoding
mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q0f16-MLC \
--additional-models HF://mlc-ai/Llama-3.2-1B-Instruct-q4f16_1-MLC \
--speculative-mode "small_draft" \
--overrides "max_num_sequence=6" \
--mode server \
--host 0.0.0.0 \
--port 5000
Just FYI @dusty_nv !