How to run the gemma4-assistant models using EUGR's custom vLLM fork?

I would like to run this:

with

and/or

Using EUGR’s custom vLLM fork. Has anyone been able to get this to work?

Thanks!

I launch like this as the draft model (see final arg). I don’t know if the rest of the options are optimal:

./launch-cluster.sh -t vllm-node-tf5 exec \
vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
  --max-model-len auto \
  --gpu-memory-utilization 0.8125 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-batched-tokens 8192 \
  --host 0.0.0.0  --port 8000 \
  -tp 2 --distributed-executor-backend ray \
  --speculative-config ‘{“model”: “google/gemma-4-31B-it-assistant”, “num_speculative_tokens”: 2}’

You can use this - no recipe yet. Don’t forget to pull latest changes and rebuild the container.

./launch-cluster.sh --solo \
 exec vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
  --max-model-len auto \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --load-format fastsafetensors \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8888 \
  --max-num-batched-tokens 32768 \
  --speculative-config '{"model": "google/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
nvidia/Gemma-4-31B-IT-NVFP4 pp2048 1435.03 ± 70.00 1437.00 ± 72.04 1431.35 ± 72.04 1437.00 ± 72.04
nvidia/Gemma-4-31B-IT-NVFP4 tg1024 14.10 ± 4.24 24.78 ± 3.75
nvidia/Gemma-4-31B-IT-NVFP4 pp2048 @ d8192 1060.67 ± 4.43 9661.22 ± 40.24 9655.58 ± 40.24 9661.22 ± 40.24
nvidia/Gemma-4-31B-IT-NVFP4 tg1024 @ d8192 12.47 ± 0.77 21.40 ± 1.50
nvidia/Gemma-4-31B-IT-NVFP4 pp2048 @ d32768 596.21 ± 0.70 58402.14 ± 69.21 58396.49 ± 69.21 58402.14 ± 69.21
nvidia/Gemma-4-31B-IT-NVFP4 tg1024 @ d32768 10.15 ± 1.10 18.20 ± 2.04

llama-benchy (0.3.8.dev2+gff162bcfc)
date: 2026-05-14 18:46:35 | latency mode: api

For 26B one you will need a different drafter model, but otherwise the launch command will be similar: google/gemma-4-26B-A4B-it-assistant

The general consensus (and my testing agrees) is that MTP number of tokens drafted=7 is about optimal for 31B and MTP=4 is optimal for the 26B-A4B. These are both really good drafters.

I tried the recommended MTP values for both the dense and MoE models, but i’m seeing a lot of errors in tool-calling. Any tips? Thanks!