Speculative decoding using vLLM on the Nvidia Jetson AGX Orin 64GB dev kit

Greetings everyone,

If anyone is interested, below is a command to increase token generation output using speculative decoding with vLLM(v0 version - export VLLM_USE_V1=0) on the NVIDIA Jetson AGX Orin 64GB dev kit.

vllm serve \
    meta-llama/Llama-3.1-8B-Instruct \
    --gpu_memory_utilization 0.9 \
    --speculative-model turboderp/Qwama-0.5B-Instruct \
    --use-v2-block-manager \
    --num-speculative-tokens 5 \
    --ngram-prompt-lookup-min 10

Video demostration

Without speculative decoding

vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu_memory_utilization 0.9

Demo video: