Hello all new to the gb10 community. I now have two. What optimization and bug fixes have you done to get inference speed up? Right now on a q8 Gemma 4 load can’t get over 5t/s. Any help would be greatly appreciated. Not so much about the model but how do you get usable generation out of these. 5 is unusable honestly and it’s not even a big model.
It is a very big model, because it has 31 billion active parameters. Gemma 4 26B A4B only uses 4 billion parameters per token, and GPT OSS 120B only uses 5 billion parameters per token. Both of these models are substantially faster for that reason.
Activating 31 billion parameters per token is going to be very slow on Spark, no matter what, but it is pretty unoptimized at the moment. Since memory bandwidth is the hard limit, 273 GB/s divided by the size of the active parameters provides the theoretical limit that you can never hit. At Q8_0, 273GB/s / 31GB = 8.8 tokens/s theoretical maximum. At Q4_0, it would be about 17 tokens per second. Right now, I’m seeing about 6 tokens per second on Q8_0, so things could get faster with better software optimizations.
A dense model like Gemma 4 31B is specifically designed to maximize the intelligence that can fit on an RTX 5090, which has insane memory bandwidth, but not much memory capacity. It is not designed to be suitable for Spark, which has lots of memory capacity, but not much bandwidth.
If you link your two Sparks together and use vLLM, you can use tensor parallelism to get twice the effective memory bandwidth and twice the speed, but Gemma 4 31B is still not a good fit for Spark.
I agree. I’ve been playing around with optimization and it seems like there is a lot of bugs any which way you look at it. I think this device sits in a really weird spot where it just doesn’t have enough bandwidth to have this amount of capacity they can say that it’s supposed to be for testing and whatever the case may be but in reality if you can’t get at least 30 to 60 tokens per second inference with a 30 GB model. It’s obviously underpowered.
Agree with there are lots of bugs..
I believe most of us known the bandwidth and tokens/sec issues before purchase…
For me, Spark is more like a development and validation platform and llm is an additional tools.
Yes, it should be better with this price tag, I hope it can be at least 500GB/s+ for bandwidth (which Apple provided in this price range).
For gemma 4 31b
you can try QuantTrio/gemma-4-31B-it-AWQ, I get around 10 tokens/s during query.
Still not fast, but I think it is usable for single user in open web-ui (haven’t test it on ai agent yet)
As other mentioned, the speed of model is highly depended on bandwidth and # of active parameters, the size of models doesn’t really directly related to the performance.
I will have to try that! It’s insane to be they couldn’t get the bandwidth to even like 700-800. I’m not the biggest apple fan but if they can get 1000+ and 512gb in the m5 ultra I might dump these. They either need that or some type of bandwidth scaling optimization or trick or something? I wonder if my agents use the model with turboquant if it will be faster they claim 5-8X.
Any alternative models to Gemma4 for coding? I’m in China and tried many including Minimax and Deepseek coder V2 and they just are not cutting it. I need deep reasoning for neurosymbolic AI development and am blocked from using Opus4.6…
31B is the dense model
26B A4B will feel much faster, though slightly less accurate/intelligent.
I would recommend Qwen3.5 122B A10B autoround instead. Definitely the best right now on a Spark, unless if you need the audio ability from Gemma.
But you can get 50 to 60 tokens per second with a 60GB model… twice the size you mentioned. (GPT-OSS-120B.)
It’s only underpowered if you’re trying to fit a square peg into a round hole. The model architecture and the hardware need to be suitable for each other.
Use an NVFP4 quant (basically doubles throughput vs. FP8).
On dual spark, I got about 17t/s TG with gemma4 31B
That’s about the realm of usable. However, I mostly run qwen3.5 397B int4 on my dual sparks.
no chance
A dense model like that requires a ton of memory bandwidth. Hi bandwidth memory like found in GPUs is, as you might be aware, ridiculously expensive. Where the DGX Spark excels is running MoE models with large parameter counts that fully utilize the generous unified memory, but with smaller active parameters to accommodate bandwidth limitations. The other huge gain with the Spark is strong prefill, which often is even more critical that token generation speed.
GPT-OSS120b
Qwen 3.5 122b
Qwen 3.5 35b
are all much better fits for DGX Spark.
MiniMax is probably the strongest out of what you mentioned for coding.
I have one of the new MacBook M5 Pros. The bandwidth is great, but prompt processing is still quite slow relative to the DGX
Had a go at trying Gemma 31B, since their QAT release and because it doesnt overthink and is more succinct in its output its actually faster than both Qwen 3.6 27B and 35B for me. Tool eval bench completes in 200 seconds, while with Qwen 3.6 35B takes 400s and 27B 750s for me.
Overall its pretty decent and definitely more token efficient than Qwen.
Recipe below:
recipe_version: "1"
name: Gemma-4-31B-IT-NVFP4-DGX-Spark-NCCL
description: vLLM serving Gemma-4-31B-IT-NVFP4 with DGX Spark NCCL/RoCE backend tuning
# HuggingFace model to download (optional, for --download-model)
model: melcheikh/gemma-4-31B-it-qat-NVFP4-Blackwell
# This variant is intended for a DGX Spark cluster / no-Ray multi-node launch.
cluster_only: true
solo_only: false
# Container image to use
container: vllm-node-tf5
build_args:
- --tf5
# Mods
mods:
- mods/fix-gemma4-chat-template
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.8
max_model_len: 262144
max_num_batched_tokens: 32768
# Environment variables
env:
VLLM_FLOAT32_MATMUL_PRECISION: high
TORCH_MATMUL_PRECISION: high
PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
VLLM_USE_FLASHINFER_SAMPLER: "1"
TORCH_CUDA_ARCH_LIST: "12.1a"
FLASHINFER_CUDA_ARCH_LIST: "12.1a"
NCCL_NET: IB
NCCL_IB_DISABLE: "0"
# Keep these overridable by launcher-provided env, but default to the compose
# values. Adjust for your nodes with run-recipe.py --eth-if/--ib-if or -e.
NCCL_IB_HCA: "${NCCL_IB_HCA:-rocep1s0f1,roceP2p1s0f1}"
NCCL_SOCKET_IFNAME: "${NCCL_SOCKET_IFNAME:-enP7s7,enp1s0f1np1}"
NCCL_IB_GID_INDEX: "${NCCL_IB_GID_INDEX:-0}"
NCCL_CROSS_NIC: "1"
NCCL_CUMEM_ENABLE: "0"
NCCL_IGNORE_CPU_AFFINITY: "1"
NCCL_IB_SUBNET_AWARE_ROUTING: "1"
NCCL_DEBUG: WARN
HF_TOKEN: <hf_token>
# The vLLM serve command template
command: |
vllm serve melcheikh/gemma-4-31B-it-qat-NVFP4-Blackwell \
--max-model-len {max_model_len} \
--hf-overrides.text_config '{{"use_bidirectional_attention":null}}' \
--speculative-config '{{"method": "dflash", "model": "z-lab/gemma-4-31B-it-DFlash", "num_speculative_tokens": 5, "attention_backend": "flash_attn"}}' \
--quantization modelopt \
--enable-flashinfer-autotune \
--enable-prefix-caching \
--gpu-memory-utilization {gpu_memory_utilization} \
--override-generation-config '{{"temperature": 1, "top_p": 0.95, "top_k": 64}}' \
--chat-template fixed_chat_template.jinja \
--default-chat-template-kwargs '{{"preserve_thinking": true}}' \
--port {port} \
--host {host} \
--kv-cache-dtype bfloat16 \
--enable-chunked-prefill \
--max-num-seqs 4 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--load-format instanttensor \
--language-model-only \
--max-num-batched-tokens {max_num_batched_tokens} \
-tp {tensor_parallel} \
--distributed-executor-backend mp
I’ve run a bunch of the Gemma4 variants through some benchmarks with InspectAI and posted the results here:
It includes the total run time for the tests. I don’t think measuring t/s is a great comparison because some sometimes the faster t/s models aren’t as smart and will waste tokens. For example Gemma4 26B-A4B used 500k input tokens compared to Gemma4 31B using 35k.
qat-w4a16-ct is unsurprisingly fastest, but also (unsurprisingly) scored worst (although not by a large margin in these benchmarks). I haven’t tried to use them for anything useful yet, so I don’t know how that trade-off feels in actual use.
the spark is a piece painted gold ■■■■
my macbook pro m4 max leaves it so far in the dust it is unreal, literally
It does? What speeds do you get on it?
I agree I dumped them and bought a gaudi 2 server lol
Qwen 3.5 122b - nothing else get close on 1 spark, and its is in top 2-3 on dual
Two sparks cost 7-8K. There are no MacBooks with 256G ram. Last time I checked MacBook M5 Max 16 with 128GB and 4TB NVmE it was around 6-6.5K USD. MacStudio with 256 will be more than 8K AFAIK.
If you don’t develop or care for CUDA and just need inference - either Mac or Spark will never pay off. Got API.
If you need to max out speed and RAM on consumer level - MacStudio M3 Ultra 512GB. 20 000 usd or something? Go big or go home I guess
But let’s calculate: 512GB = 4x Sparks at 4K + 4 cables at 100 + router at 1500 = ~18K USD
And 4 sparks in TP=4 will annihilate speed of a single M3 Ultra GPU. Take your poison.
PS and for the argument that dense model does not scale with tensor parallelism - not true. For the science I did run Qwen 3.6 27b on 2 spark cluster with vllm-distributed. Token generation went up to 30 tok/s from 21-22 tok/s. 40% up. So it works. But for MoE it works better - from 60 to 70% speed up. Almost nobody makes dense models anymore - inference too expensive. As above someone correctly said - the model is made for a very specific niche of gaming GPU with very fast cores and memory but very constrained on GB, e.g. 5090. Running it on GPU with vast GB but slower throughput is like fitting a square nut in a round hole and crymeariver afterwards.




