I’ve been experimenting with LLM inference on my DGX Spark and found a setup that not only gets 45+ tokens/second but actually feels great to use day-to-day.
This is the AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 model. It’s completely uncensored — no alignment filtering, no refusals, no “I cannot help with that” walls. It responds directly and honestly without the typical guardrails. This is genuinely refreshing if you’re tired of models that over-refuse or give sanitized answers.
BLAZING FAST with OpenClaw
When paired with OpenClaw, this setup feels incredibly responsive:
Responses stream in smoothly without lag
Long outputs finish quickly
The typing experience is fluid and satisfying
It doesn’t feel like you’re waiting for a model — it feels like a tool that keeps up with you. Very good feeling overall!
Performance Comparison
Tested on DGX Spark with max_tokens=200, warmup excluded:
Setup
Model
Speed
Memory
This Setup ✅
Gemma-4-26B Uncensored NVFP4 (MoE)
45.26 tok/s
~16.3 GB
vLLM LilaRest 31B
Gemma-4-31B NVFP4 (Dense)
9.16 tok/s
~18.5 GB
Ollama
gemma4:31b (Dense)
8.05 tok/s
~19 GB
Quick Start
git clone https://github.com/ZengboJamesWang/dgx-spark-vllm-gemma4-26b-uncensored.git
cd dgx-spark-vllm-gemma4-26b-uncensored
bash scripts/start.sh
bash scripts/benchmark.sh
I can confirm this. I use this as a “smaller” agentic modell with vision capabilities. Works like a charm with 48.78 t/s. I am using this with eugr’s community spark-vllm more or less out of the box. Just take the recipe and change the model name (and adjust vram consumption e.g. 0.3 with 132K context size).
I didn’t use the script but using the manual docker run command result in an error “Unable to locate package nvidia-cuda-runtime-cu12” when using the vllm-openai cu130-nightly image.
Hi all, I have updated the repo by switching to use recipe of AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 · Hugging Face , but with modified parameters: --gpu-memory-utilization 0.60 and --max-model-len 262000 — these differ from the HuggingFace model card defaults (0.85 and 65536). After extensive testing on DGX Spark, we found 0.60 with the full 262000 context provides the optimal balance — using ~47GB GPU memory (vs ~100GB with the card defaults) while maintaining the full context window and the same 45+ tok/s performance. So try the new settings.
For the record this model scales rather well with concurrency.
I’ve clocked sustained aggregate >600 tok/s at c=128 over 15+ hour runs for prefix heavy document extraction tasks.
The inherent brevity of Gemma4 in instruct mode with a well crafted prompt and schema-enforced output make this an extremely compelling combination on Spark today - which will improve further with better NVFP4 support (it’s currently Marlin under the hood).
I’m not sure why it doesn’t make sense to compare the 31B dense with 26B MoE at the same precision? The 26B outperforms the 31B dense in almost everything as far as I can see and if it delivers those results at 5 times the token generation speed, in less VRAM, I’ll take it!
It is completely different architecture. In dense model all parameters (whole model) are activated, but 26B has only 4B params, about 8 times less. It is equal to 4B dense model with specifics of this particular area. Basically, if you work with some very specific area, 4B model will deliver better results than 26B. The comparison which I tell does not make sense - is the performance in tokens. If model is not able to reach some particular level, it does not matter how fast it could not reach it. In models 5% difference is the difference between looping over “wait a second…” and finding a proper solution from first try.