We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ!

After months of iteration and research, we have finally filled in the gap of NVFP4 not being leveraged. The results speak for themselves. We are working with Nvidia to include this in their official community image.

The era of “Use 4-bit AWQ over NVFP4” is now OVER. NVFP4 is the same size, yet runs faster.

Even AWQ models run faster inside our image.

Our image runs Qwen3-Next-A3B-80B-Instruct-NVFP4 at 60-110 tok/s when adding speculative decoding. Details of this in the link below as well.

Link: https://blog.avarok.net/we-unlocked-nvfp4-on-dgx-spark-and-its-20-faster-than-awq-72b0f3e58b83
GitHub (WARNING-PLEASE WAIT FOR STABALIZATION!): GitHub - Avarok-Cybersecurity/dgx-vllm: A dedicated effort to make an optimized, bleeding edge vLLM image using Docker to support DGX comprehensively

Special thanks to Azeez Ishaqui, MS Machine Learning, for working on this solution

27 Likes

Will test it tomorrow. :-D

All benchmark data, scripts, and raw JSON results are available in the project repository.

Which repo? GitHub? HF?

Will those patches also go upstream to the vllm-project?

Here you go! https://github.com/Avarok-Cybersecurity/dgx-vllm

1 Like

Ooops. Seems to be private as I get an 404.

Sorry about that, it’s public now!

2 Likes

Nice Job!! Looking forward to implementing the model myself. I do really like the Qwen3 Next Instruct 80b model short TTFT vs Nemotron Nano etc which when I am working fit better with my work flow than than waiting 30-45s in thinking mode.

To get to 60-120 toks/ sec from my baseline of 42 toks/sec using regular images is very impressive.

This will change the game for NVFP4 on DGXspark.

Many thanks,

Mark

Thank you, Mark.

By the way, about your article (assuming NO speculative decoding): it’s curious how you’re getting 45 tps on 8 bit quants, yet, I’ve noticed less performance on 4 bit quants in general. Regardless, the measured NVFP4 is faster than AWQ now. Nonetheless, we can run a model that is half the size with the same speed and nearly the same accuracy, so that’s a win unto itself for us enthusiasts. Maybe there’s something in your configuration that could be adapted to speed things up further?

1 Like

Wow, awesome work, super appreciated.

2 Likes

Thanks for your efforts - I’ll pull this tonight.

Would love an even more in-depth dive related to the exact changes needed. While moving to Marlin is fine, ideally CUTLASS would ‘strike back’ and retake the lead on native hardware.

It would also be ideal if this could be integrated into @eugr ‘s community vLLM repo, for consistent usage with clusters etc.

For people who will now be rapidly searching for new NVFP4 quantized models: realize the quant’s performance is related to the data used to guide the quant. For use in your domain, doing your own quant may well outperform a general one. This is different from dynamic FP8 where data to guide the quant is generally unnecessary.

The GB10 should be a great platform for doing this, and more discussion on performing NVFP4 quantization across model architectures would be welcome.

1 Like

Thanks. Since you’re going to test, when running the image, you should see logs like:

(EngineCore_DP0 pid=167) INFO 02-20 20:53:21 [nvfp4.py:169] Using 'MARLIN' NvFp4 MoE backend out of potential backends: ['VLLM_CUTLASS', 'MARLIN'].                                                           
(EngineCore_DP0 pid=167) INFO 02-20 20:53:21 [cuda.py:365] Using AttentionBackendEnum.FLASHINFER backend.

Why do you think that CUTLASS would be better?

As for community integration, that’s being done w/ Nvidia. Other users are, of course, free to take the code and do as they please. I will continue to optimize the DGX Spark for us. I want a mass movement away from the cloud for democratized AI. That can only be done if we can run impressive models with affordable hardware. This DGX is great for tinkering in that regard.

1 Like

Interesting! Also curious whether it will work with this model. Need to think how to adapt the parameters and use ray/tensor parallelism correctly on 2x DGX Spark.

I would hope so, that’s a very powerful model. At 130B params, 2x DGX’s would certainly do the job, even w/ KV cache and speculative decoding. I do have a cluster, so, I can give it a shot when I have time.

2 Likes

Because all else equal, NVIDIA’s own software should be the optimal option for running on their hardware. CUTLASS is the NVIDIA first party abstraction layer on top of CUDA.

This is generally thought to be true across the board. That’s why we run CUDA rather than, say, Vulkan on these things. That’s why the official NGC container for Multi-LLM-NIM prefers TensorRT-LLM if available.

Beating CUTLASS is impressive, but it’s in Nvidia’s best interest to treat that as a challenge.

2 Likes

True. I am going to look into porting the CUTLASS MoE grouped GEMM to SM120/SM121. That may beat 42 tok/s

Cannot thank you and the wider community enough. Godsend.

2 Likes

Have you run any of the models through a verification process where you compare benchmark scores (HLE, GPQA, MMLU) to ensure they’re consistent to official implementations?

The worry I have is with all of these community alterations/forks especially those not being submitted upstream that there is an overrepresention of single batch raw tokens per second benchmarks and not overall model validation or massively batched agentic scenarios.

In my current implementation I’m able to complete a complete pass of gpqa diamond (198 parallel long context reasoning) with gpt-oss-120b in 3 hours using official nims, I’d be interested to know where this lands under those conditions and if it keeps nemotron 3 nano in nvfp4 from exploding like it does currently with flashinfer.

2 Likes

If it produces correct outputs (as validated by running official vllm tests), it would be much better if you contributed it upstream where it belongs (that means vLLM and CUTLASS repos).

6 Likes

Thank you, I will confer with the relevant parties. I think somebody should merge any relevant code into your repo for those who use your image. Our collective goal is to leverage the hell out of the DGX Spark so that we can all enjoy our own LLMs locally. Things are only getting more efficient as time goes on.

1 Like

We have used the more generalized Pareto benchmarks (results in repo in json format). I’m sure we could do the more specialized vllm tests to complement Pareto.

I’ll have a look next week. TBF, I’m not a big fan of sed-based patch scripts instead of regular diffs. Yes, they are a bit more flexible, but that is also dangerous as may encounter a silent failure mode where everything compiles, but works differently. At least when diffs fail it prompts you to investigate and adjust.

1 Like