Introducing PrismaQuant

Hi all,

Some of you have seen my earlier posts about getting NVFP4 working on the Spark, and the tribulations that entailed. When I first got the Spark I felt disheartened by the performance — especially watching Int4 AutoRound models clobber NVFP4 runs (when NVFP4 would even load).

Here’s the thing: NVFP4 isn’t the problem. It’s a strictly better format than Int4 for LLM weight distributions — the microscalar scale factor captures outlier channels that uniform Int mangles. MXFP4 / MXFP6 / MXFP8 are all in the same family. Where NVFP4 has been losing is that most quantized NVFP4 checkpoints leave huge swaths of the
model untouched in BF16 — out of caution, not measurement. AutoRound quantizes everything to Int4 uniformly. If you compare “everything at 4-bit Int” against “one 4-bit family + a hand-picked BF16 list,” the Int4 wins on size at roughly tied quality. That’s not a format issue. That’s an allocation issue.

I asked myself: can we measure which layers are actually sensitive to NVFP4 and put only those in BF16 / MXFP8, while everything else goes 4-bit?

Turns out yes — and it’s called PrismQuant. Every layer refracts into a different format based on its own sensitivity.

How it works, briefly: PrismQuant runs a short Fisher-information probe over your model, measures per-(Linear, format) quantization error, and feeds both into a multi-choice knapsack allocator that picks each Linear’s format under a total-bit budget. The output is a recipe like “127 Linears in NVFP4, 26 in MXFP8, 252 in BF16” — no
hand-picked ignore lists, driven entirely by measurement. It even emits a Pareto curve across target bit budgets so you can see the knee explicitly.

The important part: the output is a standard compressed-tensors checkpoint. Zero vLLM patches. Zero custom kernels. Download from HF, vllm serve, done. It ships with the model’s MTP heads already quantized and speculative decoding operational (–speculative-config method=mtp works out of the box).

Concrete result, Qwen3.6-35B-A3B at 4.75 bpp (zero-shot lm-eval, DGX Spark GB10, vLLM 0.19.2):

BF16 source: 70 GB baseline
PrismQuant 4.75 bpp (ours): 22 GB -0.56 pp mean vs BF16
RedHatAI uniform NVFP4: 24 GB -2.21 pp mean vs BF16

At smaller disk than the uniform NVFP4 artifact, PrismQuant wins 8 of 9 commonsense-zero-shot metrics (arc_easy/challenge, piqa, hellaswag, winogrande — sign test p < 0.02). The arc_easy gap is 3.11 pp at 2.6σ, statistically significant. Reasoning-heavy benchmarks (MMLU / GSM8K / HumanEval) are still pending.

Onboarding a new model is short. Gemma 4 took a ~70-line profile file. The framework auto-derives the fused-sibling structure and name remaps from vLLM’s own packed_modules_mapping and hf_to_vllm_mapper attributes — so every time vLLM adds a new model, PrismQuant follows for free. There’s a validator (python -m
prismquant.model_profiles.validate --model /path) that runs 7 consistency checks before you waste a probe run.

Tested on: Qwen3.5-27B, Qwen3.6-35B-A3B (multimodal MoE, with MTP), Gemma 4 (in progress). Targets 3-bit, MXFP6, and MiniMax M2.7 on a single Spark next.

Try it:

📦 Code: GitHub - RobTand/prismaquant: Mixed-precision quantization for LLMs. Every layer refracts into a different format based on its sensitivity. Native compressed-tensors export, validated on Qwen3.6-35B-A3B MoE with MTP speculative decoding. · GitHub

🤗 Pre-built Qwen3.6-35B-A3B @ 4.75 bpp: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm · Hugging Face

This is a v1 — it’s solid but there are probably still improvements to be made. I’m especially excited for future 3-bit support (which could make Minimax 2.7 possible on a single Spark!), as well as targeting GPUs with much more limited memory than the Spark. Imagine being able to quantize a model to fit a 24GB 3090 and knowing each weight is at the absolute highest quality possible for the hardware.

Thanks for your attention –

Rob

Thank you – Thank you – Thank you!!!

Step by step recipes so we can follow along at home – I am in.

Think of your target audience (me) as having the knowhow of a friendly Labrador – it makes it more inclusive for those who are new to all this and afraid to ask.

This is really neat tech! However, with the whole model heterogeneous, how does the backend work and how does throughput realistically look?

Today in vLLM we have 7+ backends, a few attention options, and MoE may operate differently than dense. That’s a thicket for support, and it can be frustrating enough just to get one quant dtype optimized.

It’s quite fast. Try it out! There’s nothing in vllm that doesn’t allow different layers/linears to be different precisions. Right now we’re aligned to what blackwell supports – NVFP4, MXFP4, MXFP8, FP8, BF16 – but eventually we may want to support 3-bit quantization which may require a custom kernel. Today, though, it just works in vllm with existing well-supported types.

I’ll grab it and experiment alongside the int4fp8 hybrid regarding throughput. Thanks for your work!

This is awesome! Thanks for your work on this!!

How large a model can I convert on my spark? Like can I increase swap to work on something larger, or does it need to all fit into physical memory?

Swap is bad lol. Don’t use swap. I had no trouble doing Qwen3.6 in bf16 – so ~70 gigs? YMMV, haven’t tried anything larger like Qwen3.5-122B

I ended up using swap with Intel Auto Round to overcome OOM during loading, which worked, even if it threshed a bit, once it got going it settled down.

That is my goal ;)

I do one linear/shard at a time, so it should be possible. I am literally trying it now. Race ya!

there are some issues. I’m going to fix them and then upload to HF

If you are working with AutoRound, set the low_gpu_mem_usage flag to true. I believe that should allow almost anything to be quantified on the Spark, at the cost of overall efficiency of course.

I build something to stream weights for analysis on the spark, so we should have a support for 122B soon. I’ll also be uploading something to HF in the next day or so.

I wonder if a Dual Node cluster would be sufficient to build Qwen 3.5 397b – that model would greatly benefit from your work. I’ll follow the development and looking forward to the first few success cases.

Benchmarks of models optimized for PrismQuant vs. NVFP4/AutoRound/FP8 counterparts would be interesting to understand the implications beyond accuracy and model weights.

Very keen to give your 122B a try, run it through its passes and provide feedback.

I have a 2x Gx10 cluster and would love to try this on the Minimax M2.7, think it’s feasible?

WOW! This is awesome!

I do. It’s next up on my list as soon as I finish Qwen3.5-122B (it’s nearly done; just testing now). I’ve designed everything to run streaming so that it doesn’t need more memory than the box has to create a quantized image.

Wow - Cool.

A spark-vllm-docker recipe based on the Hugging Face description.
( rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm · Hugging Face )

This is my first time writing a recipe, so if anything is incorrect or there is room for improvement, please feel free to let me know :) :)

model: rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm

container: vllm-node-tf5

mods:
  - mods/fix-qwen3-coder-next
  - mods/fix-qwen3.5-chat-template

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  speculative_config: '\{\"method\":\"mtp\",\"num_speculative_tokens\":3\}'

env: 
  FLASHINFER_DISABLE_VERSION_CHECK: 1

command: |
  vllm serve rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --attention-backend flashinfer \
    --enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --quantization compressed-tensors \
    -tp {tensor_parallel} \
    --speculative-config={speculative_config}