Introducing PrismaScout -- PrismaQuant v2!

Continuing the topic from the other thread, rather than bumping the old PrismaQuant thread, from here: Introducing PrismaQuant - #169 by tenari

@tenari I went down a rabbit hole regarding the suspected anomalous tensor weights in Qwen models described in the Reddit thread and that Aeon model.

I believe this repo - which was hard to find - includes the method to find and fix the anomalous weight scales. It is currently targeted at GGUFs but should be portable to safetensors. That has been on my shortlist.

How do you pick the optimal target bits for a model? Trial and error? The repo example is 4.75… but you appear to use much higher for the 27B variant of qwen3.6. Curious about how to detect what too little, what is optimal and what is too much.

No, prismquant should pick the kneedle — or the “knee” of the graph

Rob are there any special things we need to do to run PrismaSCOUT? I have a spare GB10 for about a week and I was planning to tackle some other models like 35b and 122b – is it ready for that? If so could you provide some broad brush strokes so we can follow along at home, like:

  1. Download the full fat FP16 model
  2. Run …

This is really outside my wheelhouse so any guide would increase my confidence. These can take a long time to run and I don’t want to waste compute on a misconfigured process chain.

I also have a “spare” gb10 for running dev jobs.
I am more than willing to pitch in, if needed with conversion jobs?

Thank you for a awesome project!

I just ran in this tracebak while trying the new SCOUT version

pipeline] [4/4] exporting to compressed-tensors ...
[export-stream] model profile: qwen3_5
[export-stream] act-aware passes: awq=False gptq=True awq_round=False scale_sweep=True
[export-stream] unified input_global_scale across 110 fused-sibling groups (max pre-unify drift: 2.477e-01)
[export-stream] raw activations will be loaded lazily for AWQ/GPTQ/round passes (311/501 Linears indexed)
[export-stream] input_global_scale calibrated for 311/501 Linears from /home/bernard/usbdisk/dq-runs/Qwopus3.6-35B-A3B-v1-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm-4.75bits/act
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/bernard/usbdisk/github/prismaquant/prismaquant/export_native_compressed.py", line 5670, in <module>
    main()
  File "/home/bernard/usbdisk/github/prismaquant/prismaquant/export_native_compressed.py", line 4952, in main
    validate_mtp_assignment_coverage(args.model, assignment, profile)
  File "/home/bernard/usbdisk/github/prismaquant/prismaquant/export_native_compressed.py", line 5661, in validate_mtp_assignment_coverage
    raise RuntimeError(
RuntimeError: source checkpoint contains mtp.* weights but the allocator recipe contains no mtp.* entries. Re-run the incremental probe + cost with --include-mtp (the default) so mtp.* tensors are measured, then rerun allocator/export.

What should I do to recover from this?

I ran it with:

export MODEL_PATH=/home/bernard/.cache/huggingface/hub/models--Jackrong--Qwopus3.6-35B-A3B-v1/snapshots/bcf49105a92d28b3023985378b6366baf21225be
export WORK_DIR=~/usbdisk/dq-runs/Qwopus3.6-35B-A3B-v1-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm-4.75bits
export FORMATS=NVFP4,MXFP8_E4M3,BF16
export TARGET_BITS=4.75

./prismaquant/run-pipeline.sh

Hi Thomas, the monitoring looks nice, where is the “spec-live” monitoring from ? You build it ?

It’s part of tool-eval-bench Release v1.5.0 - Speculative Decoding Dashboard · SeraphimSerapis/tool-eval-bench · GitHub

@tenari I worked with claude to fix the bug… here is the details if you want to integrate it in the code base:

Bug: MTP layers silently skipped when config omits num_nextn_predict_layers

Some Qwen3.5/3.6 finetunes (e.g. Jackrong/Qwopus3.6-35B-A3B-v1) retain the mtp.* weights

from the base checkpoint but do not carry the num_nextn_predict_layers / num_mtp_layers

field in their config.json. This causes build_extended_shard_regexes to derive

n_mtp_config = 0, and the existing guard:


n_mtp = min(n_mtp_config, n_mtp_actual) if n_mtp_actual > 0 else 0

silently resolves to min(0, 1) = 0, dropping all MTP shards from the probe/cost schedule.

The probe still captures MTP Fisher stats (because incremental_probe has a separate detection

path), but the cost pickle ends up with zero mtp.* entries. The allocator therefore produces

a layer_config.json with no mtp.* assignments, and the export fails at the

validate_mtp_assignment_coverage guard with:


RuntimeError: source checkpoint contains mtp.* weights but the allocator recipe contains

no mtp.* entries. Re-run the incremental probe + cost with --include-mtp (the default)

so mtp.* tensors are measured, then rerun allocator/export.

Fix

In incremental_probe.py, build_extended_shard_regexes: treat the three cases explicitly

instead of using a single min():

| n_mtp_config | n_mtp_actual | Old behaviour | New behaviour |

|—|—|—|—|

| 0 | > 0 | min(0, N) = 0 — MTP silently dropped | Trust weights; schedule n_mtp_actual shards + log notice |

| > 0 | 0 | Skip (correct) | Unchanged |

| > 0 | > 0 | min(config, actual) | Unchanged |

No changes to incremental_measure_quant_cost.py or the export — the fix is entirely in the

shared shard-schedule function that both probe and cost consume.

If anyone is interested in a prismaQuant version of Qwopus3.6-35B-A3B-v1: cyburn/Qwopus3.6-35B-A3B-v1-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm-4.75bits · Hugging Face

What are the benefits of this one over the regular quants? I see a lot of “opus” version but I never had good success with them.

Good question. Supposed it does better at HTLM web sites. Trained on Opus reasonning prompts and answers. But is it better than base qwen3.6? I can’t say. But I figured I would try the new prosmaScout version on it since no one had done it before. But based on the training output… 4.85bit appear to be the sweet spot… So, I might create it for testing.

Wish we could specify a “Pred. ΔLoss” target instead of bits.

I have made about a million updates in the last few days. I’ve seen this before but don’t think I’ve shipped any models w/ missing MTP weights. Thanks for the heads-up

let me see if I can put together a guide. it’s been changing every day. I am almost happy with where PrismaScout is I just keep finding things that annoy me and they take me down rabbit holes.

@whpthomas , @jl121 you guys are amazing. Once this is in a state where I want it, I’m going to start doing some bulk converts and help to do so will be GREATLY appreciated. I keep almost getting to the finish line and then finding something that really aggravates me or behaves non-deterministically. One thing I’ll say about Claude and Codex is that they can really amplify perfectionistic tendencies (lol).

The complexity (if you’re interested) all boils down to the interlayer interaction piece. Before we were measuring sensitivity just kind of naively, and applying some nice numerical polish (GPTQ, outlier sweeps, etc) and shipping.

To do outlier interaction, you basically have to tweak one linear (a subsection of one layer) with a quantization and then see how that propagates down the stack. But, that linear may already have numerical polish applied which can change whether it actually benefits from more bits for storage.

The search space, as you can imagine, is absolutely massive. Usually a dynamic programming algorithm that splits things into subproblems or makes progress towards a more optimal solution would be the key here, but the way we measure improvement (kl) is not necessarily convex – so imagine trying to solve a nested subproblem that gets worse before it eventually gets better. It’s… kind of awful and computationally very, very expensive. So we’ve having to revert back to some more tried-and-true methods that yield good, but not provably perfect results. I’ve had to come to terms with being unable to achieve “optimal” results because the search space is simply too large.

There are some academic papers on this – and fortunately they’re posted free on ArXiv – but most of them take a much more naive approach to solving, OR, have a much smaller search space because they’re optimizing tiny vision models.

I do kinda have a background in math, but only academically; I’ve never used it professionally, so I’ve just been trying to learn as much as I can so that I can direct the agents productively towards implementation. I am not always successful (lol)

Fun fact you models beat FP8 on speed and quality. I always assumed original quants are best in quality but apparently i was wrong. Thanks for your wonderful work. Tests described here https://x.com/vr8vr8/status/2052752870395511266

One day i had it crashing a lot so opened discussion on HF offering them to send data with what i started to run in circles. Guess what my thread was removed quickly… So i stay away from DFLASH

There’s nothing wrong with the FP8… it’s just naive.

It’s better to make 75% of the layers nvfp4 and 25% bf16 than it is to make 100% of them FP8. Some layers are just that sensitive. The value of nvfp4 is going to be finding the right place to use it.

I really hope that we can make PrismaScout demonstrably better at lower bit rates. The version we’ve shipped at 5.3 bits is pretty good.. but not universally better than 5.5. The testing I’ve done has shown me there’s still a good bit of headroom to squeeze out by properly quantifying the 2nd order interlayer interactions.

Stay tuned for more, and thanks for the positive feedback. It really keeps me going.

Rob

That’s true and decent tests help to identify those issues.

This framework continues to impress. I’m curious, would be possible to calculate the held-out KL statistic to allow comparison between arbitrary quants in safetensors format? In practice this PrismaQuant framework seems to be the best I’ve seen to approach quantization in general, and thus I’d trust it over “vibes” or small benchmarks as a framework to evaluate quants.

I’d like to verify my tests and vibes on some model families with better data, in particular the Gemma4 family.