Hi all,
Some of you have seen my earlier posts about getting NVFP4 working on the Spark, and the tribulations that entailed. When I first got the Spark I felt disheartened by the performance — especially watching Int4 AutoRound models clobber NVFP4 runs (when NVFP4 would even load).
Here’s the thing: NVFP4 isn’t the problem. It’s a strictly better format than Int4 for LLM weight distributions — the microscalar scale factor captures outlier channels that uniform Int mangles. MXFP4 / MXFP6 / MXFP8 are all in the same family. Where NVFP4 has been losing is that most quantized NVFP4 checkpoints leave huge swaths of the
model untouched in BF16 — out of caution, not measurement. AutoRound quantizes everything to Int4 uniformly. If you compare “everything at 4-bit Int” against “one 4-bit family + a hand-picked BF16 list,” the Int4 wins on size at roughly tied quality. That’s not a format issue. That’s an allocation issue.
I asked myself: can we measure which layers are actually sensitive to NVFP4 and put only those in BF16 / MXFP8, while everything else goes 4-bit?
Turns out yes — and it’s called PrismQuant. Every layer refracts into a different format based on its own sensitivity.
How it works, briefly: PrismQuant runs a short Fisher-information probe over your model, measures per-(Linear, format) quantization error, and feeds both into a multi-choice knapsack allocator that picks each Linear’s format under a total-bit budget. The output is a recipe like “127 Linears in NVFP4, 26 in MXFP8, 252 in BF16” — no
hand-picked ignore lists, driven entirely by measurement. It even emits a Pareto curve across target bit budgets so you can see the knee explicitly.
The important part: the output is a standard compressed-tensors checkpoint. Zero vLLM patches. Zero custom kernels. Download from HF, vllm serve, done. It ships with the model’s MTP heads already quantized and speculative decoding operational (–speculative-config method=mtp works out of the box).
Concrete result, Qwen3.6-35B-A3B at 4.75 bpp (zero-shot lm-eval, DGX Spark GB10, vLLM 0.19.2):
BF16 source: 70 GB baseline
PrismQuant 4.75 bpp (ours): 22 GB -0.56 pp mean vs BF16
RedHatAI uniform NVFP4: 24 GB -2.21 pp mean vs BF16
At smaller disk than the uniform NVFP4 artifact, PrismQuant wins 8 of 9 commonsense-zero-shot metrics (arc_easy/challenge, piqa, hellaswag, winogrande — sign test p < 0.02). The arc_easy gap is 3.11 pp at 2.6σ, statistically significant. Reasoning-heavy benchmarks (MMLU / GSM8K / HumanEval) are still pending.
Onboarding a new model is short. Gemma 4 took a ~70-line profile file. The framework auto-derives the fused-sibling structure and name remaps from vLLM’s own packed_modules_mapping and hf_to_vllm_mapper attributes — so every time vLLM adds a new model, PrismQuant follows for free. There’s a validator (python -m
prismquant.model_profiles.validate --model /path) that runs 7 consistency checks before you waste a probe run.
Tested on: Qwen3.5-27B, Qwen3.6-35B-A3B (multimodal MoE, with MTP), Gemma 4 (in progress). Targets 3-bit, MXFP6, and MiniMax M2.7 on a single Spark next.
Try it:
🤗 Pre-built Qwen3.6-35B-A3B @ 4.75 bpp: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm · Hugging Face
This is a v1 — it’s solid but there are probably still improvements to be made. I’m especially excited for future 3-bit support (which could make Minimax 2.7 possible on a single Spark!), as well as targeting GPUs with much more limited memory than the Spark. Imagine being able to quantize a model to fit a 24GB 3090 and knowing each weight is at the absolute highest quality possible for the hardware.
Thanks for your attention –
Rob