Introducing PrismaQuant

tenari · April 20, 2026, 12:27am

Hi all,

Some of you have seen my earlier posts about getting NVFP4 working on the Spark, and the tribulations that entailed. When I first got the Spark I felt disheartened by the performance — especially watching Int4 AutoRound models clobber NVFP4 runs (when NVFP4 would even load).

Here’s the thing: NVFP4 isn’t the problem. It’s a strictly better format than Int4 for LLM weight distributions — the microscalar scale factor captures outlier channels that uniform Int mangles. MXFP4 / MXFP6 / MXFP8 are all in the same family. Where NVFP4 has been losing is that most quantized NVFP4 checkpoints leave huge swaths of the
model untouched in BF16 — out of caution, not measurement. AutoRound quantizes everything to Int4 uniformly. If you compare “everything at 4-bit Int” against “one 4-bit family + a hand-picked BF16 list,” the Int4 wins on size at roughly tied quality. That’s not a format issue. That’s an allocation issue.

I asked myself: can we measure which layers are actually sensitive to NVFP4 and put only those in BF16 / MXFP8, while everything else goes 4-bit?

Turns out yes — and it’s called PrismQuant. Every layer refracts into a different format based on its own sensitivity.

How it works, briefly: PrismQuant runs a short Fisher-information probe over your model, measures per-(Linear, format) quantization error, and feeds both into a multi-choice knapsack allocator that picks each Linear’s format under a total-bit budget. The output is a recipe like “127 Linears in NVFP4, 26 in MXFP8, 252 in BF16” — no
hand-picked ignore lists, driven entirely by measurement. It even emits a Pareto curve across target bit budgets so you can see the knee explicitly.

The important part: the output is a standard compressed-tensors checkpoint. Zero vLLM patches. Zero custom kernels. Download from HF, vllm serve, done. It ships with the model’s MTP heads already quantized and speculative decoding operational (–speculative-config method=mtp works out of the box).

Concrete result, Qwen3.6-35B-A3B at 4.75 bpp (zero-shot lm-eval, DGX Spark GB10, vLLM 0.19.2):

BF16 source: 70 GB baseline
PrismQuant 4.75 bpp (ours): 22 GB -0.56 pp mean vs BF16
RedHatAI uniform NVFP4: 24 GB -2.21 pp mean vs BF16

At smaller disk than the uniform NVFP4 artifact, PrismQuant wins 8 of 9 commonsense-zero-shot metrics (arc_easy/challenge, piqa, hellaswag, winogrande — sign test p < 0.02). The arc_easy gap is 3.11 pp at 2.6σ, statistically significant. Reasoning-heavy benchmarks (MMLU / GSM8K / HumanEval) are still pending.

Onboarding a new model is short. Gemma 4 took a ~70-line profile file. The framework auto-derives the fused-sibling structure and name remaps from vLLM’s own packed_modules_mapping and hf_to_vllm_mapper attributes — so every time vLLM adds a new model, PrismQuant follows for free. There’s a validator (python -m
prismquant.model_profiles.validate --model /path) that runs 7 consistency checks before you waste a probe run.

Tested on: Qwen3.5-27B, Qwen3.6-35B-A3B (multimodal MoE, with MTP), Gemma 4 (in progress). Targets 3-bit, MXFP6, and MiniMax M2.7 on a single Spark next.

Try it:

📦 Code: GitHub - RobTand/prismaquant: Mixed-precision quantization for LLMs. Every layer refracts into a different format based on its sensitivity. Native compressed-tensors export, validated on Qwen3.6-35B-A3B MoE with MTP speculative decoding. · GitHub

🤗 Pre-built Qwen3.6-35B-A3B @ 4.75 bpp: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm · Hugging Face

This is a v1 — it’s solid but there are probably still improvements to be made. I’m especially excited for future 3-bit support (which could make Minimax 2.7 possible on a single Spark!), as well as targeting GPUs with much more limited memory than the Spark. Imagine being able to quantize a model to fit a 24GB 3090 and knowing each weight is at the absolute highest quality possible for the hardware.

Thanks for your attention –

Rob

whpthomas · April 20, 2026, 12:47am

Thank you – Thank you – Thank you!!!

Step by step recipes so we can follow along at home – I am in.

Think of your target audience (me) as having the knowhow of a friendly Labrador – it makes it more inclusive for those who are new to all this and afraid to ask.

jwarner · April 20, 2026, 1:01am

This is really neat tech! However, with the whole model heterogeneous, how does the backend work and how does throughput realistically look?

Today in vLLM we have 7+ backends, a few attention options, and MoE may operate differently than dense. That’s a thicket for support, and it can be frustrating enough just to get one quant dtype optimized.

tenari · April 20, 2026, 1:05am

It’s quite fast. Try it out! There’s nothing in vllm that doesn’t allow different layers/linears to be different precisions. Right now we’re aligned to what blackwell supports – NVFP4, MXFP4, MXFP8, FP8, BF16 – but eventually we may want to support 3-bit quantization which may require a custom kernel. Today, though, it just works in vllm with existing well-supported types.

jwarner · April 20, 2026, 1:28am

I’ll grab it and experiment alongside the int4fp8 hybrid regarding throughput. Thanks for your work!

DropTheBeat · April 20, 2026, 2:24am

This is awesome! Thanks for your work on this!!

whpthomas · April 20, 2026, 2:55am

How large a model can I convert on my spark? Like can I increase swap to work on something larger, or does it need to all fit into physical memory?

tenari · April 20, 2026, 2:59am

Swap is bad lol. Don’t use swap. I had no trouble doing Qwen3.6 in bf16 – so ~70 gigs? YMMV, haven’t tried anything larger like Qwen3.5-122B

whpthomas · April 20, 2026, 3:01am

I ended up using swap with Intel Auto Round to overcome OOM during loading, which worked, even if it threshed a bit, once it got going it settled down.

That is my goal ;)

tenari · April 20, 2026, 3:03am

I do one linear/shard at a time, so it should be possible. I am literally trying it now. Race ya!

tenari · April 20, 2026, 3:12am

there are some issues. I’m going to fix them and then upload to HF

joshua.dale.warner · April 20, 2026, 4:19am

If you are working with AutoRound, set the low_gpu_mem_usage flag to true. I believe that should allow almost anything to be quantified on the Spark, at the cost of overall efficiency of course.

tenari · April 20, 2026, 4:22am

I build something to stream weights for analysis on the spark, so we should have a support for 122B soon. I’ll also be uploading something to HF in the next day or so.

serapis · April 20, 2026, 4:52am

I wonder if a Dual Node cluster would be sufficient to build Qwen 3.5 397b – that model would greatly benefit from your work. I’ll follow the development and looking forward to the first few success cases.

Benchmarks of models optimized for PrismQuant vs. NVFP4/AutoRound/FP8 counterparts would be interesting to understand the implications beyond accuracy and model weights.

whpthomas · April 20, 2026, 5:23am

Very keen to give your 122B a try, run it through its passes and provide feedback.

ekkis · April 20, 2026, 12:21pm

I have a 2x Gx10 cluster and would love to try this on the Minimax M2.7, think it’s feasible?

bernardlbmi3 · April 20, 2026, 1:44pm

WOW! This is awesome!

tenari · April 20, 2026, 3:19pm

I do. It’s next up on my list as soon as I finish Qwen3.5-122B (it’s nearly done; just testing now). I’ve designed everything to run streaming so that it doesn’t need more memory than the box has to create a quantized image.

lewald_jens · April 20, 2026, 3:28pm

Wow - Cool.

THUNDER_SPARK · April 20, 2026, 3:38pm

A spark-vllm-docker recipe based on the Hugging Face description.
( rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm · Hugging Face )

This is my first time writing a recipe, so if anything is incorrect or there is room for improvement, please feel free to let me know :) :)

model: rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm

container: vllm-node-tf5

mods:
  - mods/fix-qwen3-coder-next
  - mods/fix-qwen3.5-chat-template

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 16384
  speculative_config: '\{\"method\":\"mtp\",\"num_speculative_tokens\":3\}'

env: 
  FLASHINFER_DISABLE_VERSION_CHECK: 1

command: |
  vllm serve rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --attention-backend flashinfer \
    --enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --quantization compressed-tensors \
    -tp {tensor_parallel} \
    --speculative-config={speculative_config}

Topic		Replies	Views
Introducing PrismaScout -- PrismaQuant v2! DGX Spark / GB10	100	6427	June 25, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	40	17634	June 23, 2026
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	297	27394	June 16, 2026
Introducing Spark Auto Round /w OpenCode Instruct dataset DGX Spark / GB10 cuda , spark , agentic-ai	78	1907	June 25, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	16992	March 24, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11421	April 9, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	6380	May 4, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13032	May 15, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	309	27637	June 22, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2834	May 11, 2026

Introducing PrismaQuant

Related topics