[Guide] Uncensored Gemma-4-26B at 45 tok/s on DGX Spark — Actually Feels Great to Use!

Hey DGX Spark community! 👋

I’ve been experimenting with LLM inference on my DGX Spark and found a setup that not only gets 45+ tokens/second but actually feels great to use day-to-day.

GitHub Repo: GitHub - ZengboJamesWang/dgx-spark-vllm-gemma4-26b-uncensored: High-performance uncensored Gemma-4-26B inference on NVIDIA DGX Spark using vLLM - 45+ tok/s · GitHub

🚀 What Makes This Special: UNCENSORED + FAST

UNCENSORED — No Filtered Responses

This is the AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 model. It’s completely uncensored — no alignment filtering, no refusals, no “I cannot help with that” walls. It responds directly and honestly without the typical guardrails. This is genuinely refreshing if you’re tired of models that over-refuse or give sanitized answers.

BLAZING FAST with OpenClaw

When paired with OpenClaw, this setup feels incredibly responsive:

  • Responses stream in smoothly without lag
  • Long outputs finish quickly
  • The typing experience is fluid and satisfying

It doesn’t feel like you’re waiting for a model — it feels like a tool that keeps up with you. Very good feeling overall!

Performance Comparison

Tested on DGX Spark with max_tokens=200, warmup excluded:

Setup Model Speed Memory
This Setup Gemma-4-26B Uncensored NVFP4 (MoE) 45.26 tok/s ~16.3 GB
vLLM LilaRest 31B Gemma-4-31B NVFP4 (Dense) 9.16 tok/s ~18.5 GB
Ollama gemma4:31b (Dense) 8.05 tok/s ~19 GB

Quick Start

git clone https://github.com/ZengboJamesWang/dgx-spark-vllm-gemma4-26b-uncensored.git
cd dgx-spark-vllm-gemma4-26b-uncensored
bash scripts/start.sh
bash scripts/benchmark.sh

Happy (uncensored) inferencing! 🚀

2 Likes

Does not even run as configured, do not waste you time.

Please let me know what is the error, it works well on my DGX.

I can confirm this. I use this as a “smaller” agentic modell with vision capabilities. Works like a charm with 48.78 t/s. I am using this with eugr’s community spark-vllm more or less out of the box. Just take the recipe and change the model name (and adjust vram consumption e.g. 0.3 with 132K context size).