SOTA inference speed using SGlang and EAGLE-3 speculative decoding on the NVIDIA Jetson AGX Orin

Greetings, everyone.

Below are demo videos comparing vanilla decoding with EAGLE-3 speculative decoding using the SGlang inference engine on the NVIDIA Jetson AGX Orin. The base model is LLaMA3.1-Instruct 8B.

1. Using vanilla decoding

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --device cuda \
    --dtype bfloat16 \
    --mem-fraction 0.8

Demo video:

2. Using EAGLE-3 speculative decoding

python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \
    --speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
    --cuda-graph-max-bs 1 --mem-fraction 0.8 --dtype float16 --port 30000

Demo video:

1 Like

hi can you share how to install sglang success in agx orin, i try jetson-container sglang image still fail.

Thanks.

Hi @rocfatcat, sorry for that, use instructions here to install from source: SGLang container in cu128 · Issue #939 · dusty-nv/jetson-containers · GitHub