So it’s looking like this might be the best local model we’re likely to get in 2026 (EDIT: runnable by people who aren’t oil sheiks), except maybe Qwen 3.6 122B.
I’m curious the max inference speed I can get out of it on a Spark. We all know about quantization, so that’s not what I’m asking here. Sticking to a high-quality baseline of FP8 model weights and unquantized KV cache, other than MTP can anything be done to improve throughput? Some experimental CUDA flag? Some VLLM flag?
Below are the numbers I got in my simple benchmark.
Environment
Model: Qwen/Qwen3.6-27B-FP8
VLLM: latest main build from today (via spark-vllm-docker)
At FP8 you are moving 27B parameters X 1 byte per pass = ~ 27GB. You have ~270GB/s bandwidth maximum. Divide 270GB by 27GB and you get about 10 tokens per second max throughput, which is pretty close to your baseline.
The MTP trick is actually giving you a very nice “free” boost. I’m not sure you are going to see much better over most workloads.
To see why everyone is all hot and bothered to get proper NVFP4 support — you would only have to move about 7GB per pass – and should get a theoretical 38 or so tokens per second with very minimal loss of quality.
Not all layers are created equal. In Prismaquant, we detect the sensitive layers and leave them in BF16, and take the insensitive layers and make them nvfp4. Quantization isn’t bad if you know how to use it.
I dare say a model that’s 25% bf16 and 75% nvfp4 is better than one that is fb8 through-and-through. I’ve done a lot of research to show this. Take a look at the model card I created for the latest version of Qwen3.6 Prismaquant: rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm · Hugging Face
I have a section that compares various quantizations:
Target bpp
Achieved bpp
Predicted Δloss
NVFP4 / MXFP8 / BF16
vs 5.5 bpp
4.5
4.500
948
416 / 1 / 0
+99% Δloss, −18% size
4.75
4.750
704
373 / 12 / 32
+48% Δloss, −14% size
5.0
5.000
604
347 / 14 / 56
+27% Δloss, −9% size
5.25
5.250
532
321 / 20 / 76
+12% Δloss, −5% size
5.5
5.500
477
300 / 30 / 87
← this artifact
6.0
6.000
393
270 / 35 / 112
−18% Δloss, +9% size
7.0
7.000
276
211 / 62 / 144
−42% Δloss, +27% size
8.25
8.249
180
152 / 73 / 192
−62% Δloss, +50% size
You can see that on a 16-bit model, targeting an 8.25 bit quantization can get you great results – nearly that of BF16 at half the cost.
Sorry but I just don’t trust that NVFP4 is something I should be using.
Making NVFP4 quants is non-trivial. Who is making them today?
The majority are made by amateurs on HF (never posting any benchmarks, often not even testing if their model even works). You can forget about this. NVFP4 requires an extra training round and that means spending on GPUs for a few days of training, having a great dataset to begin with, and knowing what you’re doing.
Nvidia themselves, with benchmarks showing NVFP4 is the same or better as FP8. And yet their NVFP4 has a worse KLD than AWQ 4-bit? ( Reddit - Please wait for verification ) . Is it that the extra training turning it into some sort of mini-finetune and KLD is no longer a good metric for loss of quality? idk.
No first-party AI lab is bothering to make NVFP4 quants. Not even American models like Gemma 4. It’s just a few days of extra work and yet the AI experts don’t care about it. I think that says something.
qwen3.6-27B surprises me, it finds immedaitely the bugs 35B struggles in loop, and get the bugs fixed within few rounds of debug process by itself. now i run qwen3.6-27B(Q4_K_M) in win11/RTX3090 with a acceptable speed, but i know it might become a nightmare to have 27B run in Spark. is there any way to have +30 tok/s in Spark?
I don’t know how well it translates to real work, but I was very surprised when running AgentBench that the dense model was barely slower than the MoE for a better score, see timings here:
I had assumed the MoE would be the obvious choice, but if this isn’t a fluke and it happens in real work, the dense one is the way to go. Those numbers are all with the same settings (MTP2 but nothing else).
(I do plan to add dflash and some of the other quant formats when I have time)
I totally believe this. I know people were praising the 3.6 MoE, but I preferred 3.5 dense even before.
The MoE can follow instructions that already correctly identify what should be done, but it can’t come up with a good plan on its own for a non-trivial task. It’s not smart enough. It tries to solve things at the wrong layer (for example, I had a JSON escaping issue from bash → docker → ‘bash -c’ causing VLLM’s json.loads to give an error, it spent the first 10 minutes reading the source code of VLLM inside the 3rd party docker container and trying to patch based on that).
This is infuriating to use interactively. I’d rather wait on a slower but smarter dense model, that requires fewer guidance interactions from me. And as seen in your benchmark, actually finishes the task faster because it makes fewer mistakes.
I’m surprised the benchmark scores between MoE and dense are that close in your test tbh. It’s just not what I observed in reality.
One benchmark probably isn’t a great indicator tbh, I really need to add some more. I will review the logs at some point too, but I don’t want to review loads of logs for lodas of models, so I’m hoping that I can use the numbers to narrow things down to a smaller selection that are worth reviewing (at which point I’d like to play around with the different flags/spec-decoding options, and maybe try to run the benchamrks through different harnesses to see how different tools affect things).
So you don’t consider Nvidia a first party AI lab?
“They aren’t making them” is a logical fallacy when Nvidia will do it for the big players. Major releases get NVFP4 quants done for them, released openly, and embedded into NIMs. Why would they?
NVFP4 is a 4 bit quant and like all such requires calibration data. Why? Because there is some loss. Calibrate your own if concerned. Or use FP8. But it isn’t “bad” - it’s a tool.
Tailored harnesses definitely pump up the score. Someone wrote a dedicated harness aimed at small local models (ie prompt, smart auto-retry strategy, detecting thinking loops) had the score of 3.5 9B go up from 19% to 45% in Aider Polyglot without/with the harness.
With Qwen3.6-35B-A3B UD-Q4_K_M on TerminalBench, he got 40%, which is the score Sonnet 4.5 got when used with Claude Code, which is the most likely use-case for developers doing agentic tasks.
Nice, this is the sort of thing I keep thinking about. Like there are many different possible implementations of an edit tool for modifying files. It would be fun to run evals across lots of variations and see how they affect things, but first I need to figure out a good model to start with. While the results might differ between models, trying every combination would take forever, so picking a good base model to start with and then iterating on tools/harness would probably be more reasonable.
My point against your statement is that Unsloth does create NVFP4 quants and he almost unarguably does more quantization than anyone even including the actual model providers. What the Spark community is facing is that very few people target our hardware, NVFP4 or otherwise. This is sadly especially true for the largest models. It’s why I’ve been hoping we can get improved GGUF support since that’s what almost everyone is releasing in. At least we could use those until a more optimal quant format is generated by someone for the given model that we can actually use.
This is one of the enterprise vendor for Inference as a Service. NVFP4 very popular for Blackwell generation as it gives lot more performance from limited corp hardware (clouds different story). But spark is sm12x not sm100 (actual Blackwell).
Overall, best NVFP4 done same way as REAP, tuned to message history for that particular use case.
RedHatAI also seems reputable, they also produce NVFP4 quants. On the amateur side- it’s relatively easy to do (easy to quantize to NVFP4), provided you have all the data you need. There are a lot that work just fine.
The issue is the amount of RAM you need in order to quantize the largest models. Trying to quantize something like GLM or Kimi can take 2TB of RAM. I tried doing it with 192GB and disk swap and even Claude and Gemini gave up after trying everything they could think of, saying it wasn’t going to happen and to just rent a RunPod with 2TB so that everything stayed in RAM.
Prismquant handles quantization incrementally. Try it! I spent $100 a few weeks ago renting an H100 trying to quantize qwen3.5 122B fully to nvfp4 before doing this prismquant work. If you just want pure quantization you can ask Claude or codex to disable all the other bits; but those parts are great as well.
I can do massive models locally with 80-90 gb of ram. The more ram you have the better. This is on a spark, btw. I’ve quantized the full minimax 2.7 with memory to spare.
Also: re nvfp4. First party labs probably aren’t doing it because it’s so new. Nvidia is doing it for nemotron, but the Chinese labs aren’t doing it because they literally can’t get blackwell, and even if could, the addressable market is too small. Deepseek released V4 yesterday with an mxfp4 expert layer since that format is a bit more standard and Blackwell compatible, but also regarded as inferior to nvfp4.
Is nvfp4 perfect and all there yet? No, but it’s also < 1 year old. This community has been integral in getting support for it into vllm, cutlass, flashinfer etc and will continue to be so.