Gemma 4 on DGX Spark (GB10): System Freeze at >80% Utilization & sm_121 Kernel Issues

prabhat.kmr · April 9, 2026, 6:25am

Environment: CUDA 13.0, Driver 580.126.xx, MAX Engine / vLLM (aarch64)

Model: google/gemma-4-31B-it (BF16 & FP8)

I am experiencing a critical system-level freeze (hard-lock requiring power cycle) on the DGX Spark when running Gemma 4 at high memory utilization. While the 128GB unified pool should hold the 31B model, the system becomes unresponsive as soon as the KV Cache attempts to pre-allocate beyond the ~80GB HBM boundary.

Specific Observations:

Setting gpu_memory_utilization > 0.8 consistently triggers a hang. This appears related to the C2C interconnect not handling “paging” correctly under pressure during the prefill phase.
Using the MAX engine for the MoE variant (26B-A4B) fails with ValueError: Failed to resolve module path for MOGGKernelAPI. It seems sm_121 binaries for MoE routing are missing from the current graph compiler.
Official aarch64 images (e.g., vllm-openai:latest) lacks sm_121 targets in their prebuilt binaries, defaulting to sm_120, which causes decoding errors or crashes.

I have

Updated to latest Spark OTA (February update).
Forced TRITON_ATTN for heterogeneous head dimensions.
Attempted the avarok/vllm-dgx-spark community image, which provides better stability but still hits the memory ceiling.

I want to be able to load Gemma 4 in MAX. I have also raised an an issue on Modular’s github. Is there anything I should attempt on the machine to resolve this or will a patch be required either by Nvidia or on MAX?

Thanks

Zambonilli · April 9, 2026, 5:45pm

I’ve ran the MoE model using eugr’s recipe for hundreds of thousands of inferences with 80%+ utilization for hours. Compared to qwen3 35b, it was much more stable and responsive.

prabhat.kmr · April 9, 2026, 6:03pm

Apologies. I’m still new to this world. Can you point me to what you mean by eugr recipe? Thanks

Zambonilli · April 9, 2026, 7:27pm

No problem. Here’s the recipe and the repo has all the instructions for setup. There’s also the sparkrun project that is similar in spirit but supports more than just vllm.

github.com/eugr/spark-vllm-docker

recipes/gemma4-26b-a4b.yaml

main

# Recipe: Gemma4-26B-A4B
# Gemma4-26B-A4B model in online FP8 quantization

recipe_version: "1"
name: Gemma4-26B-A4B
description: vLLM serving Gemma4-26B-A4B

# HuggingFace model to download (optional, for --download-model)
model: google/gemma-4-26B-A4B-it

# Only cluster is supported
cluster_only: false
solo_only: false

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

This file has been truncated. show original

Sparkrun

dbsci · April 10, 2026, 2:16am

As a follow-up to @Zambonilli,

If you’re just getting started, sparkrun is the easiest way to start.

# Install uv (you probably want that anyway...)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install sparkrun and set up your cluster in one step
# will guide you into setup wizard to do all the necessary configuration for your spark
uvx sparkrun setup

And then you can run predefined recipes or make your own. Here is an existing recipe for gemma 4 MoE.

sparkrun run sparkrun run @experimental/gemma4-26b-a4b-online-fp8-vllm

You can view the recipe file from the same link that @Zambonilli sent: https://github.com/spark-arena/recipe-registry/tree/main/experimental-recipes/gemma4

sparkrun is meant to help you get set up whether you have one spark or a cluster. And then give you more freedom to focus on inference / what you’re trying to actually accomplish and spend less time on troubleshooting.

And if you get into it, you can then run benchmarks and publish your benchmarks to Spark Arena. You can see what other people are running and what kind of performance they’re getting.

The Spark Arena Team (@eugr, @raphael.amorim and myself) are also working to publish more official recipes to help you be able to run more models, more easily!

More docs for sparkrun at: https://sparkrun.dev

Or just ask on the forums.

Topic		Replies	Views
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	22	1794	April 5, 2026
Does anyone have Gemma 4 31B running on Spark DGX? DGX Spark / GB10	8	1414	April 9, 2026
Someone post this: Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark DGX Spark / GB10	4	1473	April 5, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	4687	April 7, 2026
Gemma 4 Models - which vLLM version? Any PRs spotted? DGX Spark / GB10 nim , llama	157	5974	April 10, 2026
How to run Gemma-4-NVFP4 in vLLM Docker? DGX Spark / GB10	10	2133	April 8, 2026
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	556	April 7, 2026
"vLLM + Gemma 4 on NVIDIA DGX Spark GB10" - has anyone testing this implementation? DGX Spark / GB10	0	137	April 7, 2026
DGX Spark OS crash on llama4 launch DGX Spark / GB10	7	209	March 14, 2026
46tok/s with RedHatAI/gemma-4-26B-A4B-it-NVFP4 DGX Spark / GB10 llama	15	448	April 9, 2026

Gemma 4 on DGX Spark (GB10): System Freeze at >80% Utilization & sm_121 Kernel Issues

Related topics