Hey guys and gals. Hopping in, not sure how I can help because I am new to this. But I have a cluster of 8 Sparks and willing to help.
So please excuse me if I don’t get all of the wording right.
Hey guys and gals. Hopping in, not sure how I can help because I am new to this. But I have a cluster of 8 Sparks and willing to help.
So please excuse me if I don’t get all of the wording right.
the flexing is hard :p
Awwwee man, I don’t mean to flex. I would just love to give back somehow and am somewhat intimidated on how to help the community. TBH, I don’t know much about or know the proper venacular, but I have them and willing to put them to work.
Confirming gemma-4-31b-it loads and serves on Spark via vllm/vllm-openai:gemma4-cu130.
Pulled the image clean on ARM64. Architecture resolved natively to Gemma4ForConditionalGeneration… no Transformers fallback. TRITON_ATTN forced automatically for the heterogeneous head dims. Running BF16, enforce-eager, 32K context, 0.85 gpu-memory-utilization.
Model weights downloading now. Will post speed numbers once it’s serving.
Launch command for anyone else testing:
sudo docker run -d
–name kasari-gemma4-test
–gpus all
–no-healthcheck
-p 8000:8000
-v /root/.cache/huggingface:/root/.cache/huggingface
-e HF_TOKEN=your-token
vllm/vllm-openai:gemma4-cu130
–model google/gemma-4-31b-it
–host 0.0.0.0
–port 8000
–max-model-len 32768
–gpu-memory-utilization 0.85
–enforce-eager
–trust-remote-code
Waiting on the NVFP4 checkpoint to see where the real performance lands.
(Claude helped me write this post lol)
Since you have a stack you can start by looking at the community repo maintain by @eugr .
You will find info and recipe for 4 spark cluster stack. I guess 4 or 8 is kinda same.
Then you can either fit big ass model and benchmark them with llama-benchy for inference performance or try build a diverse stack on well choosen quantized model. dgx spark have smaller bandwith than GPU gaming but they are able to be steady on volume and have big vram and chip to chip interface. It seems to me running quantize model on spark is the best thing to do.
Thanks for that.
When did you build the image? I’ve just published new wheels that should support this model an hour ago or so.
Also, you need --tf5 build with Transformers v5.
I’m about to test myself - haven’t had a chance yet, downloading the MoE model now.
docker run -it --gpus all -p 8000:8000 --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/ptrck/models:/models \
--entrypoint bash \
vllm/vllm-openai:gemma4 -c '
PARSER=/usr/local/lib/python3.12/dist-packages/vllm/tool_parsers/gemma4_tool_parser.py &&
sed -i "s/from vllm.tool_parsers.abstract_tool_parser import ToolParser/from vllm.tool_parsers.abstract_tool_parser import Tool, ToolParser/"
$PARSER &&
sed -i "s/def __init__(self, tokenizer: TokenizerLike):/def __init__(self, tokenizer: TokenizerLike, tools: list[Tool] | None = None):/"
$PARSER &&
sed -i "s/super().__init__(tokenizer)/super().__init__(tokenizer, tools)/" $PARSER &&
python3 -m vllm.entrypoints.openai.api_server \
--model /models/gemma-4-26B-A4B-it \
--trust-remote-code \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--chat-template /models/gemma-4-26B-A4B-it/chat_template.jinja'
Got it working with this command, can drop the manual patching once we get a container with that PR that got merged.
spark-vllm-docker can also run BF16 model:
./build-and-copy.sh -t vllm-node-tf5 --tf5
./launch-cluster.sh -t vllm-node-tf5 --solo \
exec vllm serve google/gemma-4-26B-A4B-it \
--max-model-len auto \
--gpu-memory-utilization 0.7 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--load-format fastsafetensors
But of course FP8 model would be better, as BF16 performance is not optimal:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| google/gemma-4-26B-A4B-it | pp2048 | 3823.10 ± 368.77 | 547.80 ± 53.75 | 541.04 ± 53.75 | 547.95 ± 53.76 | |
| google/gemma-4-26B-A4B-it | tg32 | 24.04 ± 0.73 | 24.67 ± 0.94 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d8192 | 3352.34 ± 173.08 | 3290.74 ± 174.12 | 3283.97 ± 174.12 | 3290.94 ± 174.22 | |
| google/gemma-4-26B-A4B-it | tg32 @ d8192 | 22.97 ± 0.03 | 23.33 ± 0.47 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d16384 | 3113.20 ± 12.19 | 6388.67 ± 18.05 | 6381.91 ± 18.05 | 6388.82 ± 18.03 | |
| google/gemma-4-26B-A4B-it | tg32 @ d16384 | 22.26 ± 0.05 | 23.00 ± 0.00 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d32768 | 2488.89 ± 3.95 | 15157.26 ± 28.06 | 15150.50 ± 28.06 | 15157.37 ± 28.01 | |
| google/gemma-4-26B-A4B-it | tg32 @ d32768 | 21.67 ± 0.08 | 22.33 ± 0.47 |
llama-benchy (0.3.5)
date: 2026-04-02 16:43:21 | latency mode: api
most probably my build was few minutes before the patch was applied :) I rebuild right now and it works perfectly now. I can tell you it is pretty fast with the 26b version, I was able to use multitool agents and image description. Looks like it is a very solid model. the 31b nvfp4 is painfully slow but yeah. for the 26b I am using this recipe:
recipe_version: "1"
name: google-gemma4-26b
description: vLLM serving google-gemma4-26b
# HuggingFace model to download (optional, for --download-model)
model: google/gemma-4-26B-A4B-it
# Container image to use
container: vllm-node-tf5
# This model can only run on single node (solo)
solo_only: false
# No mods required
#mods:
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.8
max_model_len: 262144
# The vLLM serve command template
command: |
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len {max_model_len} \
--port {port} --host {host} \
--trust-remote-code \
--enable-auto-tool-choice \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--load-format fastsafetensors \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--gpu-memory-utilization {gpu_memory_utilization}
I’m now trying to make it run with FP8 online quantization as the only FP8 quant on HF is “weird” - it has BF16 size and doesn’t even load properly.
Is this protoLabsAI/gemma-4-26B-A4B-it-FP8?
If so, I won’t bother downloading it with my painfully slow connection.
Yeah, that’s the one
I think I’ll need to merge one pending PR for that - building with it now.
I’ve got FP8 on-the-fly quant working, but it needs a VLLM PR 35568 - I’m going to include it into my build and launch the build pipeline again, so hopefully it will be available soon, otherwise you can do it now by:
./build-and-copy.sh -t vllm-node-20260402-tf5-pr35568 --tf5 --apply-vllm-pr 35568
./launch-cluster.sh -t vllm-node-20260402-tf5-pr35568 --solo \
exec vllm serve google/gemma-4-26B-A4B-it \
--max-model-len auto \
--gpu-memory-utilization 0.7 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--load-format fastsafetensors \
--quantization fp8 \
--kv-cache-dtype fp8
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| google/gemma-4-26B-A4B-it | pp2048 | 5389.77 ± 349.88 | 388.93 ± 25.88 | 381.78 ± 25.88 | 389.07 ± 25.89 | |
| google/gemma-4-26B-A4B-it | tg32 | 38.26 ± 0.06 | 39.50 ± 0.06 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d8192 | 4325.76 ± 383.07 | 2563.59 ± 247.82 | 2556.45 ± 247.82 | 2563.74 ± 247.79 | |
| google/gemma-4-26B-A4B-it | tg32 @ d8192 | 37.40 ± 0.11 | 38.62 ± 0.11 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d16384 | 3599.34 ± 3.18 | 5507.33 ± 17.16 | 5500.18 ± 17.16 | 5507.46 ± 17.16 | |
| google/gemma-4-26B-A4B-it | tg32 @ d16384 | 36.49 ± 0.19 | 37.68 ± 0.19 | |||
| google/gemma-4-26B-A4B-it | pp2048 @ d32768 | 2359.54 ± 0.40 | 15965.33 ± 27.40 | 15958.19 ± 27.40 | 15965.40 ± 27.41 | |
| google/gemma-4-26B-A4B-it | tg32 @ d32768 | 36.01 ± 0.06 | 37.18 ± 0.06 |
llama-benchy (0.3.5)
date: 2026-04-02 17:10:09 | latency mode: api
thanks, trying this now, will post how things go.
A ~36 tps seems descent if the model itself remains capable as a 26B-MoE, will do some experimentation to see how it compares to nemo-30B.
Is this the 4bit model that didn’t perform well on the spark?
Here is my result for the dense model in a dual node config:
vllm serve cyankiwi/gemma-4-31B-it-AWQ-4bit \
--host 0.0.0.0 \
--port 8080 \
--max-model-len auto \
--max-num-seqs 4 \
--gpu-memory-utilization 0.75 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--load-format fastsafetensors \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--served-model-name gemma4 \
--distributed-executor-backend ray
Benchmark via llama-benchy --base-url http://0.0.0.0:8080/v1 --model gemma4 --latency-mode api --pp 2048 --tg 128 --depth 0 4096 8192 16384 --concurrency 1:
llama-benchy (0.3.5)
Date: 2026-04-03 05:22:59
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:--------|----------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| gemma4 | pp2048 | 1509.32 ± 28.16 | | 1249.71 ± 57.50 | 1248.16 ± 57.50 | 1249.77 ± 57.49 |
| gemma4 | tg128 | 18.70 ± 0.02 | 19.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d4096 | 1445.17 ± 9.64 | | 4081.66 ± 38.36 | 4080.12 ± 38.36 | 4081.72 ± 38.37 |
| gemma4 | tg128 @ d4096 | 18.43 ± 0.04 | 19.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d8192 | 1366.90 ± 0.06 | | 7264.20 ± 23.99 | 7262.66 ± 23.99 | 7264.28 ± 23.98 |
| gemma4 | tg128 @ d8192 | 18.12 ± 0.04 | 19.00 ± 0.00 | | | |
| gemma4 | pp2048 @ d16384 | 1190.44 ± 1.30 | | 15018.70 ± 12.20 | 15017.16 ± 12.20 | 15018.76 ± 12.20 |
| gemma4 | tg128 @ d16384 | 17.71 ± 0.05 | 18.00 ± 0.00 | | | |
llama-benchy (0.3.5)
date: 2026-04-03 05:22:59 | latency mode: api
Looks like I’m getting around 39 tps max using the FP8 version as shown above. Nice work @eugr , thank you!
I’m going to see if this model “likes lobsters” and see what that looks like :-)
I’ll share my findings. I typically run it through liteLLM to get some better visibility (I’m not running the hacked version… sigh…)
Where I tend to run into trouble is around the max context-window size. My understanding is that this mode l maxes out at 256k context size, which is actually pretty large. Performance will likely be terrible at that size, so I’m trying to do some heavy llama-benchy runs to find the sweet spot.
Not exactly sure how to interpret that output – details to come.
The new build is now on github, so no need to compile from source anymore, just do the usual:
./build-and-copy.sh -t vllm-node-tf5 --tf5
(add -c --copy-parallel if you have a cluster).
I’ll add a recipe soon.
Using the latest spark-vllm-docker build serving nvidia/Gemma-4-31B-IT-NVFP4 · Hugging Face with:
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v /home/ross/models:/home/ross/models" ./launch-cluster.sh -t vllm-node-tf5 --solo -e OMP_NUM_THREADS=4 -e PYTORCH_ALLOC_CONF=expandable_segments:True -e VLLM_WORKER_MULTIPROC_METHOD=spawn -e SAFETENSORS_FAST_GPU=1 -e VLLM_DISABLE_PYNCCL=1 -e NCCL_IB_DISABLE=1 -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 -e VLLM_NVFP4_GEMM_BACKEND=cutlass -e VLLM_USE_FLASHINFER_MOE_FP4=0 exec vllm serve /home/ross/models/Gemma-4-31B-IT-NVFP4 --quantization compressed-tensors --trust-remote-code --enable-prefix-caching --kv-cache-dtype fp8 --max-model-len auto --port 8000 --host 0.0.0.0 --language-model-only --max-num-seqs 4 --enable-auto-tool-choice --reasoning-parser gemma4 --tool-call-parser gemma4 --enable-chunked-prefill --gpu-memory-utilization 0.7 --max_num_batched_tokens 8192 --served-model-name Gemma-4-31B-IT-NVFP4
I got following results
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:---------------------|----------------:|---------------:|------------:|------------------:|------------------:|------------------:|
| Gemma-4-31B-IT-NVFP4 | pp2048 | 634.98 ± 31.14 | | 2952.82 ± 167.67 | 2950.33 ± 167.67 | 2952.89 ± 167.68 |
| Gemma-4-31B-IT-NVFP4 | tg128 | 3.43 ± 0.01 | 4.00 ± 0.00 | | | |
| Gemma-4-31B-IT-NVFP4 | pp2048 @ d4096 | 552.84 ± 1.03 | | 10705.51 ± 115.81 | 10703.02 ± 115.81 | 10705.57 ± 115.81 |
| Gemma-4-31B-IT-NVFP4 | tg128 @ d4096 | 3.40 ± 0.00 | 4.00 ± 0.00 | | | |
| Gemma-4-31B-IT-NVFP4 | pp2048 @ d8192 | 464.59 ± 2.82 | | 21507.66 ± 128.51 | 21505.17 ± 128.51 | 21507.73 ± 128.50 |
| Gemma-4-31B-IT-NVFP4 | tg128 @ d8192 | 3.34 ± 0.04 | 4.00 ± 0.00 | | | |
| Gemma-4-31B-IT-NVFP4 | pp2048 @ d16384 | 358.82 ± 0.64 | | 50016.53 ± 226.40 | 50014.05 ± 226.40 | 50016.59 ± 226.40 |
| Gemma-4-31B-IT-NVFP4 | tg128 @ d16384 | 3.31 ± 0.00 | 4.00 ± 0.00 | | | |