Greetings, everyone.
Below are demo videos comparing vanilla decoding with EAGLE-3 speculative decoding using the SGlang inference engine on the NVIDIA Jetson AGX Orin. The base model is LLaMA3.1-Instruct 8B.
1. Using vanilla decoding
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--mem-fraction 0.8
Demo video:
2. Using EAGLE-3 speculative decoding
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \
--speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
--cuda-graph-max-bs 1 --mem-fraction 0.8 --dtype float16 --port 30000
Demo video: