From the guys who brought us DFlash, this looks kinda interesting. They’ve released a bunch of models on HF, I’m downloading Qwen 3.5 27B one, will check how it compares against FP8.
How was results on them? I just found them and was interested too
Sorry, forgot to update - seemed pretty close to FP8 on accuracy, but unfortunately it did not work with TP=2 for me and also seemed to lack DFlash support, so I went back to running FP8 instead.
If its not even as good as FP8 you should probably have a look at Introducing PrismaScout -- PrismaQuant v2!
It’s important to note that paroquant is kind of an orthogonal optimization strategy. Prismaquant optimizes formats based on sensitivity – but we also do things like paroquant – GPTQ, HALO, (QuaRot), outlier sweeps, and a closed-form variant of Intel Autoround.
If paroquant ends up being great, we can easily integrate a variant of it! Feel free to make a fork ;) IIRC, paroquant uses INT4, so we wouldn’t want to leverage it exactly, but some of the strategies can be adopted.
Feel free to fork if you’re interested ;)
Given that its results with Qwen 3.6 27B model werent as good as PrismaQuant, I havent bothered looking at it any further. Performance wise it also seemed to be lagging behind FP8 / BF16 speeds, but that might have been due to lack of TP >1 and Dflash support.
I have actually made a fork, and may try to add some features from various other places I have found, with the aid of AI. hope to get round to starting in the next few days, but unfortunately I can’t guarantee this. My plan is here if you are interested: prismaquant/docs/cross_repo_quantization_ideas.md at main · Chargeuk/prismaquant · GitHub
It’s not going to be a great direct port because it’s locked to int4 today. If you do spend some time with it, replace the HALO path
I tried ParoQuant for Qwen 3.6 27B and I somehow get 2x slower prefill speed than on Prismascout or Prismaquant.
llama-benchy Averages
| Metric | PrismaSCOUT | PARO | Delta |
|---|---|---|---|
| Runtime | 0:04:50 | 0:07:20 | +2:30 slower |
| Est latency | 150.1ms | 141.9ms | slightly better |
| Avg pp t/s | 2,163 | 997 | -54% |
| Avg tg t/s | 61.7 | 58.9 | -5% |
| Avg TTFT | 7,344ms | 12,925ms | +76% worse |
| Avg Total | 11,185ms | 16,844ms | +51% worse |
Albond Decode
| Metric | PrismaSCOUT | PARO | Delta |
|---|---|---|---|
| Q&A | 37.2 | 38.6 avg | +1.4 |
| Code | 34.3 | 46.5 avg | +12.2 |
| JSON | 53.6 | 52.5 avg | -1.1 |
| Math | 51.6 | 56.5 avg | +4.9 |
| LongCode | 43.5 | 48.9 avg | +5.4 |
| Avg all | 44.0 | 48.6 | +4.6 |
| Avg excl Math | 42.2 | 46.6 | +4.4 |
Does anabody have an idea why this might be happening?
I did some work here and ultimately came to the conclusion that rotation-based schemes are NOT useful for NVFP4. I found a paper that basically validated what I determined experimentally. The larger the block-size, the more useful rotation methods are because larger blocks capture more outliers. Small block sizes – like the 16 element ones used by NVFP4 are already so small that their outliers are mostly contained to one set of elements.
I think for INT4 + MXFP4 it’s probably a reasonable strategy, although it does require runtime kernel support in VLLM which greatly worsens its feasibility short of vllm mainlining the kernel. Methods such as ParoQuant are difficult to support because they require run-time un-rotation of the data. That adds both complexity, ongoing maintenance, and getting people to ship your kernel. It’s… a lift.
Rob
