Greetings everyone,
If anyone is interested, below is a command to increase token generation output using speculative decoding with vLLM(v0 version - export VLLM_USE_V1=0) on the NVIDIA Jetson AGX Orin 64GB dev kit.
vllm serve \
meta-llama/Llama-3.1-8B-Instruct \
--gpu_memory_utilization 0.9 \
--speculative-model turboderp/Qwama-0.5B-Instruct \
--use-v2-block-manager \
--num-speculative-tokens 5 \
--ngram-prompt-lookup-min 10
Video demostration
Without speculative decoding
vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu_memory_utilization 0.9
Demo video: