[Guide] Uncensored Gemma-4-26B at 45 tok/s on DGX Spark — Actually Feels Great to Use!

Hey DGX Spark community! 👋

I’ve been experimenting with LLM inference on my DGX Spark and found a setup that not only gets 45+ tokens/second but actually feels great to use day-to-day.

GitHub Repo: GitHub - ZengboJamesWang/dgx-spark-vllm-gemma4-26b-uncensored: High-performance uncensored Gemma-4-26B inference on NVIDIA DGX Spark using vLLM - 45+ tok/s · GitHub

🚀 What Makes This Special: UNCENSORED + FAST

UNCENSORED — No Filtered Responses

This is the AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 model. It’s completely uncensored — no alignment filtering, no refusals, no “I cannot help with that” walls. It responds directly and honestly without the typical guardrails. This is genuinely refreshing if you’re tired of models that over-refuse or give sanitized answers.

BLAZING FAST with OpenClaw

When paired with OpenClaw, this setup feels incredibly responsive:

  • Responses stream in smoothly without lag
  • Long outputs finish quickly
  • The typing experience is fluid and satisfying

It doesn’t feel like you’re waiting for a model — it feels like a tool that keeps up with you. Very good feeling overall!

Performance Comparison

Tested on DGX Spark with max_tokens=200, warmup excluded:

Setup Model Speed Memory
This Setup Gemma-4-26B Uncensored NVFP4 (MoE) 45.26 tok/s ~16.3 GB
vLLM LilaRest 31B Gemma-4-31B NVFP4 (Dense) 9.16 tok/s ~18.5 GB
Ollama gemma4:31b (Dense) 8.05 tok/s ~19 GB

Quick Start

git clone https://github.com/ZengboJamesWang/dgx-spark-vllm-gemma4-26b-uncensored.git
cd dgx-spark-vllm-gemma4-26b-uncensored
bash scripts/start.sh
bash scripts/benchmark.sh

Happy (uncensored) inferencing! 🚀

Does not even run as configured, do not waste you time.

Please let me know what is the error, it works well on my DGX.

I can confirm this. I use this as a “smaller” agentic modell with vision capabilities. Works like a charm with 48.78 t/s. I am using this with eugr’s community spark-vllm more or less out of the box. Just take the recipe and change the model name (and adjust vram consumption e.g. 0.3 with 132K context size).

I didn’t use the script but using the manual docker run command result in an error “Unable to locate package nvidia-cuda-runtime-cu12” when using the vllm-openai cu130-nightly image.

I switch to using AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 · Hugging Face recipe (which I think is where ZengboJamesWang is getting his reference from) and it works!

Hi all, I have updated the repo by switching to use recipe of AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 · Hugging Face , but with modified parameters: --gpu-memory-utilization 0.60 and --max-model-len 262000 — these differ from the HuggingFace model card defaults (0.85 and 65536). After extensive testing on DGX Spark, we found 0.60 with the full 262000 context provides the optimal balance — using ~47GB GPU memory (vs ~100GB with the card defaults) while maintaining the full context window and the same 45+ tok/s performance. So try the new settings.

For the record this model scales rather well with concurrency.

I’ve clocked sustained aggregate >600 tok/s at c=128 over 15+ hour runs for prefix heavy document extraction tasks.

The inherent brevity of Gemma4 in instruct mode with a well crafted prompt and schema-enforced output make this an extremely compelling combination on Spark today - which will improve further with better NVFP4 support (it’s currently Marlin under the hood).

The speed comparison does not make sense, because 31b is dense model. This should be compared to basic 27b version, I guess, it would be very similar.

I’m not sure why it doesn’t make sense to compare the 31B dense with 26B MoE at the same precision? The 26B outperforms the 31B dense in almost everything as far as I can see and if it delivers those results at 5 times the token generation speed, in less VRAM, I’ll take it!

It is completely different architecture. In dense model all parameters (whole model) are activated, but 26B has only 4B params, about 8 times less. It is equal to 4B dense model with specifics of this particular area. Basically, if you work with some very specific area, 4B model will deliver better results than 26B. The comparison which I tell does not make sense - is the performance in tokens. If model is not able to reach some particular level, it does not matter how fast it could not reach it. In models 5% difference is the difference between looping over “wait a second…” and finding a proper solution from first try.