I would like to run this:
with
and/or
Using EUGR’s custom vLLM fork. Has anyone been able to get this to work?
Thanks!
I would like to run this:
with
and/or
Using EUGR’s custom vLLM fork. Has anyone been able to get this to work?
Thanks!
I launch like this as the draft model (see final arg). I don’t know if the rest of the options are optimal:
./launch-cluster.sh -t vllm-node-tf5 exec \
vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
--max-model-len auto \
--gpu-memory-utilization 0.8125 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--load-format fastsafetensors \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--max-num-batched-tokens 8192 \
--host 0.0.0.0 --port 8000 \
-tp 2 --distributed-executor-backend ray \
--speculative-config ‘{“model”: “google/gemma-4-31B-it-assistant”, “num_speculative_tokens”: 2}’
You can use this - no recipe yet. Don’t forget to pull latest changes and rebuild the container.
./launch-cluster.sh --solo \
exec vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
--max-model-len auto \
--gpu-memory-utilization 0.7 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--load-format fastsafetensors \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8888 \
--max-num-batched-tokens 32768 \
--speculative-config '{"model": "google/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 | 1435.03 ± 70.00 | 1437.00 ± 72.04 | 1431.35 ± 72.04 | 1437.00 ± 72.04 | |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg1024 | 14.10 ± 4.24 | 24.78 ± 3.75 | |||
| nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 @ d8192 | 1060.67 ± 4.43 | 9661.22 ± 40.24 | 9655.58 ± 40.24 | 9661.22 ± 40.24 | |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg1024 @ d8192 | 12.47 ± 0.77 | 21.40 ± 1.50 | |||
| nvidia/Gemma-4-31B-IT-NVFP4 | pp2048 @ d32768 | 596.21 ± 0.70 | 58402.14 ± 69.21 | 58396.49 ± 69.21 | 58402.14 ± 69.21 | |
| nvidia/Gemma-4-31B-IT-NVFP4 | tg1024 @ d32768 | 10.15 ± 1.10 | 18.20 ± 2.04 |
llama-benchy (0.3.8.dev2+gff162bcfc)
date: 2026-05-14 18:46:35 | latency mode: api
For 26B one you will need a different drafter model, but otherwise the launch command will be similar: google/gemma-4-26B-A4B-it-assistant
The general consensus (and my testing agrees) is that MTP number of tokens drafted=7 is about optimal for 31B and MTP=4 is optimal for the 26B-A4B. These are both really good drafters.
I tried the recommended MTP values for both the dense and MoE models, but i’m seeing a lot of errors in tool-calling. Any tips? Thanks!