Gemma 4 on DGX Spark (GB10): System Freeze at >80% Utilization & sm_121 Kernel Issues

Environment: CUDA 13.0, Driver 580.126.xx, MAX Engine / vLLM (aarch64)

Model: google/gemma-4-31B-it (BF16 & FP8)

I am experiencing a critical system-level freeze (hard-lock requiring power cycle) on the DGX Spark when running Gemma 4 at high memory utilization. While the 128GB unified pool should hold the 31B model, the system becomes unresponsive as soon as the KV Cache attempts to pre-allocate beyond the ~80GB HBM boundary.

Specific Observations:

  1. Setting gpu_memory_utilization > 0.8 consistently triggers a hang. This appears related to the C2C interconnect not handling “paging” correctly under pressure during the prefill phase.

  2. Using the MAX engine for the MoE variant (26B-A4B) fails with ValueError: Failed to resolve module path for MOGGKernelAPI. It seems sm_121 binaries for MoE routing are missing from the current graph compiler.

  3. Official aarch64 images (e.g., vllm-openai:latest) lacks sm_121 targets in their prebuilt binaries, defaulting to sm_120, which causes decoding errors or crashes.

I have

  • Updated to latest Spark OTA (February update).

  • Forced TRITON_ATTN for heterogeneous head dimensions.

  • Attempted the avarok/vllm-dgx-spark community image, which provides better stability but still hits the memory ceiling.

I want to be able to load Gemma 4 in MAX. I have also raised an an issue on Modular’s github. Is there anything I should attempt on the machine to resolve this or will a patch be required either by Nvidia or on MAX?

Thanks

I’ve ran the MoE model using eugr’s recipe for hundreds of thousands of inferences with 80%+ utilization for hours. Compared to qwen3 35b, it was much more stable and responsive.

Apologies. I’m still new to this world. Can you point me to what you mean by eugr recipe? Thanks

No problem. Here’s the recipe and the repo has all the instructions for setup. There’s also the sparkrun project that is similar in spirit but supports more than just vllm.

Sparkrun

As a follow-up to @Zambonilli,

If you’re just getting started, sparkrun is the easiest way to start.

# Install uv (you probably want that anyway...)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install sparkrun and set up your cluster in one step
# will guide you into setup wizard to do all the necessary configuration for your spark
uvx sparkrun setup

And then you can run predefined recipes or make your own. Here is an existing recipe for gemma 4 MoE.

sparkrun run sparkrun run @experimental/gemma4-26b-a4b-online-fp8-vllm

You can view the recipe file from the same link that @Zambonilli sent: https://github.com/spark-arena/recipe-registry/tree/main/experimental-recipes/gemma4


sparkrun is meant to help you get set up whether you have one spark or a cluster. And then give you more freedom to focus on inference / what you’re trying to actually accomplish and spend less time on troubleshooting.

And if you get into it, you can then run benchmarks and publish your benchmarks to Spark Arena. You can see what other people are running and what kind of performance they’re getting.

The Spark Arena Team (@eugr, @raphael.amorim and myself) are also working to publish more official recipes to help you be able to run more models, more easily!

More docs for sparkrun at: https://sparkrun.dev

Or just ask on the forums.

2 Likes