Environment: CUDA 13.0, Driver 580.126.xx, MAX Engine / vLLM (aarch64)
Model:google/gemma-4-31B-it (BF16 & FP8)
I am experiencing a critical system-level freeze (hard-lock requiring power cycle) on the DGX Spark when running Gemma 4 at high memory utilization. While the 128GB unified pool should hold the 31B model, the system becomes unresponsive as soon as the KV Cache attempts to pre-allocate beyond the ~80GB HBM boundary.
Specific Observations:
Setting gpu_memory_utilization > 0.8 consistently triggers a hang. This appears related to the C2C interconnect not handling “paging” correctly under pressure during the prefill phase.
Using the MAX engine for the MoE variant (26B-A4B) fails with ValueError: Failed to resolve module path for MOGGKernelAPI. It seems sm_121 binaries for MoE routing are missing from the current graph compiler.
Official aarch64 images (e.g., vllm-openai:latest) lacks sm_121 targets in their prebuilt binaries, defaulting to sm_120, which causes decoding errors or crashes.
I have
Updated to latest Spark OTA (February update).
Forced TRITON_ATTN for heterogeneous head dimensions.
Attempted the avarok/vllm-dgx-spark community image, which provides better stability but still hits the memory ceiling.
I want to be able to load Gemma 4 in MAX. I have also raised an an issue on Modular’s github. Is there anything I should attempt on the machine to resolve this or will a patch be required either by Nvidia or on MAX?
I’ve ran the MoE model using eugr’s recipe for hundreds of thousands of inferences with 80%+ utilization for hours. Compared to qwen3 35b, it was much more stable and responsive.
No problem. Here’s the recipe and the repo has all the instructions for setup. There’s also the sparkrun project that is similar in spirit but supports more than just vllm.
If you’re just getting started, sparkrun is the easiest way to start.
# Install uv (you probably want that anyway...)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install sparkrun and set up your cluster in one step
# will guide you into setup wizard to do all the necessary configuration for your spark
uvx sparkrun setup
And then you can run predefined recipes or make your own. Here is an existing recipe for gemma 4 MoE.
sparkrun run sparkrun run @experimental/gemma4-26b-a4b-online-fp8-vllm
sparkrun is meant to help you get set up whether you have one spark or a cluster. And then give you more freedom to focus on inference / what you’re trying to actually accomplish and spend less time on troubleshooting.
And if you get into it, you can then run benchmarks and publish your benchmarks to Spark Arena. You can see what other people are running and what kind of performance they’re getting.
The Spark Arena Team (@eugr, @raphael.amorim and myself) are also working to publish more official recipes to help you be able to run more models, more easily!