Does anyone have Gemma 4 31B running on Spark DGX?

I’m new hear and new to getting AI running locally on a NVIDIA Spark. I’m trying to get Gemma-4 31B to run. I’ve tried a couple versions and keep running into errors. Does anyone have this working correctly and can give me the commands to do it?
I have nvidia/Gemma-4-31B-IT-NVFP4 downloaded. This is a Founders Edition. I’d like to get this running correctly taking full advantage of the hardware. I want to see if this will come even close to the functionality I’m getting doing coding with GPT 5.4 and Gemini 3 Flash and Pro previews.

I’ve been working with AI’s trying to get this going for 2 days and getting nowhere fast! When I do get it running it’s so slow as to be unusable. I’m hoping to get it running correctly and faster. Any help would be GREATLY appreciated.

Scott

I got this thing working to the tune of 7 tokens per second which is about twice what it was doing. Is that all we expect to get out of Gemma-4 on a Spark?

Here’s the command line I’m using:
scott@spark:~$ docker run -it --rm --name gemma4-31b
–gpus all
-p 8000:8000
–ipc=host
-v /home/scott/.cache/huggingface:/root/.cache/huggingface
-e HF_TOKEN=$HF_TOKEN
vllm/vllm-openai:gemma4-cu130
nvidia/Gemma-4-31B-IT-NVFP4
–max-model-len 65536
–gpu-memory-utilization 0.95

Our VLLM playbook can help with running on Spark: vLLM for Inference | DGX Spark
However, that command looks right. What errors are you seeing?

Does it have to be 31B? 26B-A4B is going to be a lot faster on DGX Spark due to its MoE architecture.

For that model the community vLLM image that @eugr manages has a recipe which works well:

I ran it yesterday it was pretty awesome (and fast)

@aniculescu, Thanks for the reply. My process to get it running was round-about. I started with an AI telling me how to do it and that did not work well.

I finally did get to the VLLM playbook and got it working. I am hoping for better than 6.7 tokens per second of throughput. That is not usable. I tried the NIM install but they don’t support the ARM64 architecture yet. I hope they do soon, if it will help here. I don’t know enough about this to know if that would improve performance or not (the sycophantic AI tells me it will).

I purchased this box a week ago to try to reduce my token spend. I am using AI to make a Python app with an AI embedded in it and trying to get away from spending $2500/month on tokens. If this Spark will do the job, it will pay for itself very rapidly. I don’t know beans about the back-end setup but I’ve been programming computers since 1981 and have some Linux experience. I’m not sure that the latest Gemma model is capable of full development work like I’m doing with Gemini Pro Preview and GPT 5.4. I doubt it but I thought it was worth a try to save that kind of money. My source code requires a lot of space. It’s about 1MB and growing when I load all of it. It is very modularized so I can often get away with less but I need an AI that will will think like the Pro’s and handle a lot of tokens.

So to summarize, I’m an end-user, not a AI specialist. I need a very capable AI running locally if I can get one. I have 30 days to return it if it doesn’t work. I’m hoping for a better outcome than that!

@haidij

Thanks for the reply. I’ll try that next.

I need something with serious coding capabilities, very close to the Pro models out there and I’m skeptical. But it’s next on the roster.

Scott

Welcome to the community.

You can also use sparkrun to automate running recipes from @eugr’s spark-vllm-docker. It helps you get setup, apply best practices, and then focus on doing stuff instead of how to make it work.

@eugr, @raphael.amorim, and myself are also the maintainers of Spark Arena where you can see real performance data on different models on the Spark.

To run the 26B MoE model with sparkrun (after setup), you run:

sparkrun run @experimental/gemma4-26b-a4b-awq4-vllm

The @experimental refers to which recipe “registry” we’re pulling from. The experimental registry is published by the Spark Arena team and is essentially recipes that we’re testing/sharing that have not yet been promoted to drop the @experimentalpart.

We try to be active on the forums and help people. Reach out and let us know if things do or don’t work.

I installed LM Studio on the Spark today and got the Q4_K_M model running at about 10-11 tok/sec