Thank you for sharing your journey! Would you mind sharing some details on your NVFP4 conversion. I thought the claim was that we don’t need a calibration dataset.
Sorry to rain on your parade, but it was possible to use NVFP4 models on Spark for a while now. I haven’t seen any performance numbers in your article other than 24 t/s prefill - which is probably a number that vllm prints in the logs for a very short prompt, so it doesn’t mean anything, as it’s super slow.
Also, seeing these lines is not a proof (these are from my log for another NVFP4 model, but they are similar to ones you’ve posted):
(EngineCore_DP0 pid=2168) INFO 12-11 06:06:24 [gpu_model_runner.py:3551] Starting to load model RedHatAI/Qwen3-30B-A3B-NVFP4...
(EngineCore_DP0 pid=2168) INFO 12-11 06:06:24 [compressed_tensors_w4a4_nvfp4.py:63] Using flashinfer-cutlass for NVFP4 GEMM
(EngineCore_DP0 pid=2168) INFO 12-11 06:06:25 [cuda.py:412] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=2168) INFO 12-11 06:06:25 [layer.py:379] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=2168) INFO 12-11 06:06:25 [compressed_tensors_moe.py:253] Using Cutlass for CompressedTensorsW4A4Nvfp4MoEMethod.
You need to also look for this one:
(EngineCore_DP0 pid=2168) WARNING 12-11 06:06:25 [compressed_tensors.py:742] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
If you are not getting it, congrats, something is different in your setups, but you need to run benchmarks (e.g. vllm bench serve on 1 request and 10, for example) and compare to another 4-bit quant (AWQ).
Spark comes with CUDA 13, why does it say CUDA 12.1? But then says CUDA 13.0 in another place?
PR #29242 was merged into main 2 weeks ago.
VLLM main branch is still using version 0.11.2 which is easy to check by checking setup.py in the repository. E.g. my today build is 0.11.2.dev699+g804e3468c.d20251209.cu130
There are a lot of other inconsistencies in the article, which, frankly, looks like AI-generated.
Thank you for this helpful info, I’ll explore and potentially update the image and article to make it better!
I can check here and see that this model uses 109B params. You might be able to get it to work on a single spark with performance issues if you enable plenty of swap, but I can’t do the math on the top of my head without some more numbers.
Thankfully, I didn’t actually need to quantize the model, as somebody else did. The other quant failed to produce coherent output (mentioned in the article), but this one worked well: RESMP-DEV/Qwen3-Next-80B-A3B-Instruct-NVFP4
Thanks for taking a look at everything. As mentioned in the article, my intention was to finally get Qwen3-Next MoE’s at an nvfp4 quant working with the spark (I could earlier get INT4, 8-bit, and 16-bit quants working, but never the optimized nvfp4 quant). As the article states in the very beginning: “This breakthrough represents the convergence of three cutting-edge technologies” followed by the intersection of nvfp4 quantization, Blackwell GB10 architecture, and, Qwen3-Next MoE specifically. So, you’re okay; you’re not raining on my parade, but maybe somebody else’s parade.
I could make a new article that has a clearly stated intention to do performance benchmarks. I’m sure I could make it faster if I wanted to. My intention was to get something that wasn’t working, well, working. The article shows many iterations of trying something, failing, and trying again until it worked. Though, if you’re implying that there’s somewhere in existence that states that when I make an article like mine, it must include certain performance numbers, I’d be interested in seeing it. But, as far as I can tell, approaching another’s work and prefacing it by applying an unrequired metric, then making it appear as if the “parade is being rained on”, is like judging a fish by the way it climbs a tree. So of course it will always fail! I can apply any metric I want to anything I approach and make it seem like I’m raining on any parade with your approach, but, I’ve found this is not a helpful way to interact with other people. Though, please do let me know if you’ve had success with such an approach, as I have not on my limited time insofar on this planet.
I’ve had trouble getting any nvfp4 quant models to run on vLLM or TRT-LLM, either with the recipes on build.nvidia.com, with various other docker images, or even with trying to build them. I have run into conflicts between Cuda versions, and things refusing to support sm121a. My prior failures probably (likely?) reflect ignorance on my part, but the OPs image does run them.
II have not done any formal benchmarking, but I really don’t need to. The few models I tried (LLama-4-scout, Qwen3-80-a3b and a few others) run at a speed I would expect from unquantized models. They run, but I wouldn’t select these over other quants.
I noted the weird nvidia-smi and don’t know what to make of that either.
Unrelated – got GLM 4.6V/awq running last night and really impressed.
Nice. How is the speed? How did you run it? I am actually working on quantizing GLM 4.6v right now at an nvfp4 level, quantizing both the weights and the activation functions (thus, GLM-4.6v-NVFP4-W4A4).
In most cases, running unquantized models won’t give you any advantage over FP8/AWQ-8bit/GGUF Q8_0 quants, and even 4-bit quants are usually adequate for most tasks.
My point is that there are currently no benefits from using NVFP4 on Spark compared to AWQ quants. AWQ is substantially faster and I’m not sure about accuracy either, as AWQ keeps activation weights at 16 bit, and NVFP4 converts everything into FP4.
The only advantage NVFP4 would bring is increasing prompt processing speed and total throughput on Blackwell hardware (not even the token generation for a single request). However, I’m not seeing that yet - I’ve posted benchmarks here.
So, by using NVFP4, you are just losing big on performance on Spark, at least currently, compared to 4-bit AWQ quant. 83 t/s vs 64 t/s is a big difference.
AWQ, based on your experiments, run quicker than nvfp4. I did not know this, as I’ve just dived into this space (I’m coming from rust/networking/cryptography). Could it be that NVFP4 is just under-optimized right now, and once it is optimized, it will be faster than other quant methods (like AWQ) at the same bitness and compatible hardware (in this case, our DGX Sparks)?
If you just posted something like “look - I managed to get NVFP4 working on my Spark”, I’d probably just gave you some extra pointers and moved on. Sorry if I sounded too harsh, but when you make bold claims like this, be prepared for someone to challenge it.
Also, a minor editing/fact checking would improve the article quality significantly.
I’m referring to stuff like nvidia-smi output that doesn’t make any sense, or claiming that GB10 is CUDA 12.1 device.
I am (obviously) a big fan of AI, and I use it a lot, but we are all doing everyone a disservice when we don’t check the output. It gets ingested into search engines and training sets for future LLMs and degrades the quality for all of us.
Theoretically, it should run faster, at least for batched requests and prompt processing, because it would utilize native FP4 acceleration in Spark. You can see it in my FP8 results in the same post, which performs slightly better than AWQ 8-bit quant.
FP4, again, in theory, should give even bigger boost on GB10, but we’ll see. Right now the implementation is a bit bugged, but Nvidia folks here said they are working on it.
But that’s just for total throughput and prompt processing (prefill). Token generation will likely be in the same ballpark as AWQ, as it’s mostly memory bandwidth bound, not compute.