Hi all,
First of all, let me say I’m absolutely blown away by the response to PrismaQuant. Thank you all. So many people on this board have donated their energy to benchmarks or providing feedback.
To date, PrismaQuant models have been downloaded over 80k times from HuggingFace, and the amount of direct feedback I’ve gotten on them has been almost universally positive. This is legitimately an underserved area and I’m proud to lend my time, energy, and tokens to it. I am opening Twitter and Reddit and seeing people talking about them without even going to look for it. It’s something I’ve never experienced before.
Secondly, thanks to NVidia for creating this terrific platform. We bellyache about the Spark on this forum regularly, and certainly, it’s not perfect, but there’s no arguing that it’s facilitated a massive amount of personal and professional growth for me and has been worth every penny. I’d love a second one, but maybe I’ll wait for a Vera-generation hardware refresh.
PrismaQuant V2 is here, and it leverages a new algorithm I’m calling PrismaScout.
It’s already in github and I’m making models with it. Mathematically, it makes the original PrismaQuant look primitive – but I’m not sure that makes it better – yet. I can’t wait for you to try it and see. It upgrades how we determine sensitivity by tweaking weights and tracing those impacts all the way through the model, then doing some very cool optimization to figure out where quantization destroys the least value. I’ve leaned on some of the best and latest literature in the field – much of it preprint – and added some new mechanisms of my own. This was both a harrowing theoretical work, as well as an engineering challenge getting it to be fast enough to be workable on a single spark.
I’ve created a version of Qwen3.6-27B that is about 11% smaller and performs about 3-5% better. It’s at 5.3 bits, on average. This model is the best balance, mathematically, of performance and accuracy where the tradeoff is equally weighted between them (“the kneedle point” on the pareto curve"). It’s shipping here: rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm · Hugging Face
A paper describing the new techniques we used, as well as our citations of existing ones is forthcoming.
Excited to hear your feedback. Benchmarks (GSM8K, etc) are still running. I also hope to ship a non-blackwell specific version soon (MXFP4/MXFP8/BF16) this artifact is NVFP4/BF16 only (the optimizer elected against any MXFP8 legs).
Thank you for your continued support and feedback. Please let me know if you have any requests!
Rob


