Slow inference with 31b model Gemma 4? Optimizations?

Liquidlava1990 · April 8, 2026, 6:58pm

Hello all new to the gb10 community. I now have two. What optimization and bug fixes have you done to get inference speed up? Right now on a q8 Gemma 4 load can’t get over 5t/s. Any help would be greatly appreciated. Not so much about the model but how do you get usable generation out of these. 5 is unusable honestly and it’s not even a big model.

coder543 · April 8, 2026, 7:28pm

It is a very big model, because it has 31 billion active parameters. Gemma 4 26B A4B only uses 4 billion parameters per token, and GPT OSS 120B only uses 5 billion parameters per token. Both of these models are substantially faster for that reason.

Activating 31 billion parameters per token is going to be very slow on Spark, no matter what, but it is pretty unoptimized at the moment. Since memory bandwidth is the hard limit, 273 GB/s divided by the size of the active parameters provides the theoretical limit that you can never hit. At Q8_0, 273GB/s / 31GB = 8.8 tokens/s theoretical maximum. At Q4_0, it would be about 17 tokens per second. Right now, I’m seeing about 6 tokens per second on Q8_0, so things could get faster with better software optimizations.

A dense model like Gemma 4 31B is specifically designed to maximize the intelligence that can fit on an RTX 5090, which has insane memory bandwidth, but not much memory capacity. It is not designed to be suitable for Spark, which has lots of memory capacity, but not much bandwidth.

If you link your two Sparks together and use vLLM, you can use tensor parallelism to get twice the effective memory bandwidth and twice the speed, but Gemma 4 31B is still not a good fit for Spark.

Liquidlava1990 · April 8, 2026, 10:19pm

I agree. I’ve been playing around with optimization and it seems like there is a lot of bugs any which way you look at it. I think this device sits in a really weird spot where it just doesn’t have enough bandwidth to have this amount of capacity they can say that it’s supposed to be for testing and whatever the case may be but in reality if you can’t get at least 30 to 60 tokens per second inference with a 30 GB model. It’s obviously underpowered.

kenny8379 · April 9, 2026, 3:18am

Agree with there are lots of bugs..

I believe most of us known the bandwidth and tokens/sec issues before purchase…

For me, Spark is more like a development and validation platform and llm is an additional tools.
Yes, it should be better with this price tag, I hope it can be at least 500GB/s+ for bandwidth (which Apple provided in this price range).

For gemma 4 31b
you can try QuantTrio/gemma-4-31B-it-AWQ, I get around 10 tokens/s during query.
Still not fast, but I think it is usable for single user in open web-ui (haven’t test it on ai agent yet)

As other mentioned, the speed of model is highly depended on bandwidth and # of active parameters, the size of models doesn’t really directly related to the performance.

Liquidlava1990 · April 9, 2026, 4:10am

I will have to try that! It’s insane to be they couldn’t get the bandwidth to even like 700-800. I’m not the biggest apple fan but if they can get 1000+ and 512gb in the m5 ultra I might dump these. They either need that or some type of bandwidth scaling optimization or trick or something? I wonder if my agents use the model with turboquant if it will be faster they claim 5-8X.

hankh95 · April 9, 2026, 6:31am

Any alternative models to Gemma4 for coding? I’m in China and tried many including Minimax and Deepseek coder V2 and they just are not cutting it. I need deep reasoning for neurosymbolic AI development and am blocked from using Opus4.6…

notmy.reward438 · April 9, 2026, 12:28pm

31B is the dense model
26B A4B will feel much faster, though slightly less accurate/intelligent.

I would recommend Qwen3.5 122B A10B autoround instead. Definitely the best right now on a Spark, unless if you need the audio ability from Gemma.

coder543 · April 9, 2026, 3:20pm

But you can get 50 to 60 tokens per second with a 60GB model… twice the size you mentioned. (GPT-OSS-120B.)

It’s only underpowered if you’re trying to fit a square peg into a round hole. The model architecture and the hardware need to be suitable for each other.

pfnguyen · April 9, 2026, 8:09pm

Use an NVFP4 quant (basically doubles throughput vs. FP8).

On dual spark, I got about 17t/s TG with gemma4 31B

That’s about the realm of usable. However, I mostly run qwen3.5 397B int4 on my dual sparks.

parad8x010 · April 9, 2026, 9:04pm

no chance

josephbreda · April 9, 2026, 9:28pm

A dense model like that requires a ton of memory bandwidth. Hi bandwidth memory like found in GPUs is, as you might be aware, ridiculously expensive. Where the DGX Spark excels is running MoE models with large parameter counts that fully utilize the generous unified memory, but with smaller active parameters to accommodate bandwidth limitations. The other huge gain with the Spark is strong prefill, which often is even more critical that token generation speed.

GPT-OSS120b
Qwen 3.5 122b
Qwen 3.5 35b

are all much better fits for DGX Spark.

josephbreda · April 9, 2026, 9:29pm

MiniMax is probably the strongest out of what you mentioned for coding.

josephbreda · April 9, 2026, 9:30pm

I have one of the new MacBook M5 Pros. The bandwidth is great, but prompt processing is still quite slow relative to the DGX

arctic.gus · June 9, 2026, 5:18pm

Had a go at trying Gemma 31B, since their QAT release and because it doesnt overthink and is more succinct in its output its actually faster than both Qwen 3.6 27B and 35B for me. Tool eval bench completes in 200 seconds, while with Qwen 3.6 35B takes 400s and 27B 750s for me.
Overall its pretty decent and definitely more token efficient than Qwen.

Recipe below:

recipe_version: "1"

name: Gemma-4-31B-IT-NVFP4-DGX-Spark-NCCL

description: vLLM serving Gemma-4-31B-IT-NVFP4 with DGX Spark NCCL/RoCE backend tuning




# HuggingFace model to download (optional, for --download-model)

model: melcheikh/gemma-4-31B-it-qat-NVFP4-Blackwell




# This variant is intended for a DGX Spark cluster / no-Ray multi-node launch.

cluster_only: true

solo_only: false




# Container image to use

container: vllm-node-tf5




build_args:

  - --tf5




# Mods

mods: 

 - mods/fix-gemma4-chat-template




# Default settings (can be overridden via CLI)

defaults:

  port: 8000

  host: 0.0.0.0

  tensor_parallel: 2

  gpu_memory_utilization: 0.8

  max_model_len: 262144

  max_num_batched_tokens: 32768




# Environment variables

env:

  VLLM_FLOAT32_MATMUL_PRECISION: high

  TORCH_MATMUL_PRECISION: high

  PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True

  VLLM_USE_FLASHINFER_SAMPLER: "1"

  TORCH_CUDA_ARCH_LIST: "12.1a"

  FLASHINFER_CUDA_ARCH_LIST: "12.1a"

  NCCL_NET: IB

  NCCL_IB_DISABLE: "0"

  # Keep these overridable by launcher-provided env, but default to the compose

  # values. Adjust for your nodes with run-recipe.py --eth-if/--ib-if or -e.

  NCCL_IB_HCA: "${NCCL_IB_HCA:-rocep1s0f1,roceP2p1s0f1}"

  NCCL_SOCKET_IFNAME: "${NCCL_SOCKET_IFNAME:-enP7s7,enp1s0f1np1}"

  NCCL_IB_GID_INDEX: "${NCCL_IB_GID_INDEX:-0}"

  NCCL_CROSS_NIC: "1"

  NCCL_CUMEM_ENABLE: "0"

  NCCL_IGNORE_CPU_AFFINITY: "1"

  NCCL_IB_SUBNET_AWARE_ROUTING: "1"

  NCCL_DEBUG: WARN

  HF_TOKEN: <hf_token>




# The vLLM serve command template

command: |

  vllm serve melcheikh/gemma-4-31B-it-qat-NVFP4-Blackwell  \

    --max-model-len {max_model_len} \

    --hf-overrides.text_config '{{"use_bidirectional_attention":null}}' \

    --speculative-config '{{"method": "dflash", "model": "z-lab/gemma-4-31B-it-DFlash", "num_speculative_tokens": 5, "attention_backend": "flash_attn"}}' \

    --quantization modelopt \

    --enable-flashinfer-autotune \

    --enable-prefix-caching \

    --gpu-memory-utilization {gpu_memory_utilization} \

    --override-generation-config '{{"temperature": 1, "top_p": 0.95, "top_k": 64}}' \

    --chat-template fixed_chat_template.jinja \

    --default-chat-template-kwargs '{{"preserve_thinking": true}}' \

    --port {port} \

    --host {host} \

    --kv-cache-dtype bfloat16 \

    --enable-chunked-prefill \

    --max-num-seqs 4 \

    --trust-remote-code \

    --enable-auto-tool-choice \

    --tool-call-parser gemma4 \

    --reasoning-parser gemma4 \

    --load-format instanttensor  \

    --language-model-only \

    --max-num-batched-tokens {max_num_batched_tokens} \

    -tp {tensor_parallel} \

    --distributed-executor-backend mp

DannyTup · June 9, 2026, 5:30pm

I’ve run a bunch of the Gemma4 variants through some benchmarks with InspectAI and posted the results here:

It includes the total run time for the tests. I don’t think measuring t/s is a great comparison because some sometimes the faster t/s models aren’t as smart and will waste tokens. For example Gemma4 26B-A4B used 500k input tokens compared to Gemma4 31B using 35k.

qat-w4a16-ct is unsurprisingly fastest, but also (unsurprisingly) scored worst (although not by a large margin in these benchmarks). I haven’t tried to use them for anything useful yet, so I don’t know how that trade-off feels in actual use.

keith103 · June 9, 2026, 9:35pm

the spark is a piece painted gold ■■■■

my macbook pro m4 max leaves it so far in the dust it is unreal, literally

arctic.gus · June 9, 2026, 9:42pm

It does? What speeds do you get on it?

Liquidlava1990 · June 9, 2026, 11:45pm

I agree I dumped them and bought a gaudi 2 server lol

0rand · June 10, 2026, 11:05am

Qwen 3.5 122b - nothing else get close on 1 spark, and its is in top 2-3 on dual

0rand · June 10, 2026, 11:08am

Two sparks cost 7-8K. There are no MacBooks with 256G ram. Last time I checked MacBook M5 Max 16 with 128GB and 4TB NVmE it was around 6-6.5K USD. MacStudio with 256 will be more than 8K AFAIK.
If you don’t develop or care for CUDA and just need inference - either Mac or Spark will never pay off. Got API.

If you need to max out speed and RAM on consumer level - MacStudio M3 Ultra 512GB. 20 000 usd or something? Go big or go home I guess

But let’s calculate: 512GB = 4x Sparks at 4K + 4 cables at 100 + router at 1500 = ~18K USD
And 4 sparks in TP=4 will annihilate speed of a single M3 Ultra GPU. Take your poison.

PS and for the argument that dense model does not scale with tensor parallelism - not true. For the science I did run Qwen 3.6 27b on 2 spark cluster with vllm-distributed. Token generation went up to 30 tok/s from 21-22 tok/s. 40% up. So it works. But for MoE it works better - from 60 to 70% speed up. Almost nobody makes dense models anymore - inference too expensive. As above someone correctly said - the model is made for a very specific niche of gaming GPU with very fast cores and memory but very constrained on GB, e.g. 5090. Running it on GPU with vast GB but slower throughput is like fitting a square nut in a round hole and crymeariver afterwards.

Topic		Replies	Views
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	8477	April 7, 2026
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	22	2615	April 5, 2026
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	2490	April 7, 2026
DGX Spark performance DGX Spark / GB10	49	5848	February 13, 2026
[Guide] Uncensored Gemma-4-26B at 45 tok/s on DGX Spark — Actually Feels Great to Use! DGX Spark / GB10 Projects openclaw	9	3982	April 20, 2026
Gemma 4 Models - which vLLM version? Any PRs spotted? DGX Spark / GB10 nim , llama	177	11601	April 16, 2026
Does anyone have Gemma 4 31B running on Spark DGX? DGX Spark / GB10	8	2885	April 9, 2026
Someone post this: Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark DGX Spark / GB10	4	2741	April 5, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	27	4286	January 2, 2026
Gemma4 draft models are now available DGX Spark / GB10 Projects	8	3001	May 20, 2026

Slow inference with 31b model Gemma 4? Optimizations?

Related topics