Introducing PrismaScout -- PrismaQuant v2!

Hi all,

First of all, let me say I’m absolutely blown away by the response to PrismaQuant. Thank you all. So many people on this board have donated their energy to benchmarks or providing feedback.

To date, PrismaQuant models have been downloaded over 80k times from HuggingFace, and the amount of direct feedback I’ve gotten on them has been almost universally positive. This is legitimately an underserved area and I’m proud to lend my time, energy, and tokens to it. I am opening Twitter and Reddit and seeing people talking about them without even going to look for it. It’s something I’ve never experienced before.

Secondly, thanks to NVidia for creating this terrific platform. We bellyache about the Spark on this forum regularly, and certainly, it’s not perfect, but there’s no arguing that it’s facilitated a massive amount of personal and professional growth for me and has been worth every penny. I’d love a second one, but maybe I’ll wait for a Vera-generation hardware refresh.

PrismaQuant V2 is here, and it leverages a new algorithm I’m calling PrismaScout.

It’s already in github and I’m making models with it. Mathematically, it makes the original PrismaQuant look primitive – but I’m not sure that makes it better – yet. I can’t wait for you to try it and see. It upgrades how we determine sensitivity by tweaking weights and tracing those impacts all the way through the model, then doing some very cool optimization to figure out where quantization destroys the least value. I’ve leaned on some of the best and latest literature in the field – much of it preprint – and added some new mechanisms of my own. This was both a harrowing theoretical work, as well as an engineering challenge getting it to be fast enough to be workable on a single spark.

I’ve created a version of Qwen3.6-27B that is about 11% smaller and performs about 3-5% better. It’s at 5.3 bits, on average. This model is the best balance, mathematically, of performance and accuracy where the tradeoff is equally weighted between them (“the kneedle point” on the pareto curve"). It’s shipping here: rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm · Hugging Face

A paper describing the new techniques we used, as well as our citations of existing ones is forthcoming.

Excited to hear your feedback. Benchmarks (GSM8K, etc) are still running. I also hope to ship a non-blackwell specific version soon (MXFP4/MXFP8/BF16) this artifact is NVFP4/BF16 only (the optimizer elected against any MXFP8 legs).

Thank you for your continued support and feedback. Please let me know if you have any requests!

Rob

Some preliminary benches:

Eval metric Prior 5.5 bpp PrismaSCOUT 5.31 bpp Delta
GSM8K strict exact match 96.74% 96.66% -0.08 pp
GSM8K flexible exact match 96.66% 96.59% -0.08 pp
IFEval prompt strict 84.66% 85.40% +0.74 pp
IFEval prompt loose 87.80% 88.72% +0.92 pp
IFEval instruction strict 89.81% 89.93% +0.12 pp
IFEval instruction loose 92.09% 92.45% +0.36 pp

We’re essentially tied with the 5.5 bit, even though we shaved 2GB+ from the model.

How much does this improve speed? I haven’t tried anything below fp8 yet.

Testing this right now with DFlash and will run Tool Evals :) Thanks for this!

I haven’t yet found a 27B version that worked out well for me and I keep defaulting back for 3.6-35b or 3.5-122b. Maybe this is the one :)

It’s working great for some general tasks, but I have it infinite looping right now on some trivial code. It knows it’s going in circles but it just can’t break out.

That’s Qwen for you LOL.

Do i have to run the pipeline on the DGX spark, or is that a process that i could do on another machine, and then copy over the results once done? (I’m new to quantization)

If you just want to run the models, you can do it on a spark, 5090, or 6000 Pro with vllm. If you want to start quantizing your own models, it’s advisable to run it on something with a decent amount of VRAM – so, Spark, 6000 Pro, or a B200 if you have it laying around. I have not attempted quantization on something like a 5090 (I don’t have one) but it would probably work without too much adaption – so long as you have a decent amount of system ram. I’ve spent a good deal of time building logic that streams from NVMe to VRAM, so something analogous might need to happen where it streams from system ram into VRAM.

If you make any cool changes you’re welcome to submit a PR. Advise using Codex or Claude Code.

This is this model + DFlash . Very decent performance, a bit below-par in tool calling compared to 3.6-35B (getting consisten 92 or 93 scores).

Will be running my real-world test now that takes multiple hours :)

Can we keep the MTP heads unquantised? I noticed the acceptance rates is not as good.

tbh I think most people are just using dflash now?

Not for production. DFlash is basically for code and benchmarks, otherwise it frankly does poorly. Also, at high concurrency, it usually doesn’t win.

I am using your original Qwen3.6 35B-A3B PrismaQuant 4.75 bit for some professional domain text and MTP at 3 positions just obliterated DFlash.

All of the above is subject to change if/when vLLM integrates DDTree, but until then, standard MTP is the winner for a lot of real workloads.

Same here, DFlash is awful in acceptance rates and real wall clock performance. It looks good in benchmarks, but actual real world its terrible. Stayed with MTP over all my models for now.

Rob I think this one might be a keeper. Still testing but it handles concurrent multi-modal loads very well. I am getting excellent throughput. I will let you know run into any issues but so far this might allow 27b and 35b to run on a single spark which would be really interesting.

# Qwen3.6-27B — PrismaSCOUT (Blackwell, NVFP4 + BF16)
# PrismaQuant export of Qwen/Qwen3.6-27B for vLLM compressed-tensors serving
# on NVIDIA Blackwell.

recipe_version: "1"
name: Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm
description: vLLM serving Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm

# HuggingFace model to download (optional, for --download-model)
model: rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm

solo_only: true

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

mods:
  - mods/fix-qwen3.6-enhanced-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 196608
  gpu_memory_utilization: 0.75
  max_num_batched_tokens: 32768
  max-num-seqs: 16
  served_model_name: qwen/qwen3.6-27b
  speculative_config: '{"method": "mtp", "num_speculative_tokens": 3}'

# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --port {port} \
  --host {host} \
  --quantization compressed-tensors \
  --load-format instanttensor \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --speculative-config '{speculative_config}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template qwen3.6-enhanced.jinja \
  --reasoning-parser qwen3

#  --language-model-only

What sort of performance are you getting on it?

Often peaking at around 360 t/s on 16 concurrent threads. This is transcribing pdf pages and extracting needle in haystack type data. There is plenty of headroom here, the kv cache is hardly being used.

Obviously you need a parallelisable workload to take advantage of this.

thank you very much for the recipe, where is it - mods/fix-qwen3.6-enhanced-chat-template ?

You can find it here qwen3.6-enhanced.jinja: CoT leakage into tool turns and why preserve_thinking works now | Cheuk-Yiu Chan - https://raw.githubusercontent.com/allanchan339/vLLM-Qwen3-3.5-3.6-chat-template-fix/main/chat-template/qwen3.6-enhanced.jinja

I am also running the model at the moment and the results and especially the tool calling with the template seem to be working splendidly. MTP acceptance rate is usually pretty high too. 70-80% on my code analysis workflows.

thank you so much

I ran the same pdf conversion / extration with each quant I have

Model Size Quant Time Errors
Qwen 3.6 27b FP8 30:10 0
Qwen 3.5 122b INT4 AutoRound 24:20 0
Qwen 3.6 27b NVFP4 BF16 15:30 0
Qwen 3.6 35b FP8 11:10 2

Subjectively the PrismaSCOUT model is providing the more lively 35b like performance with the 122b quality I need. Still needs more testing but this feels like a very positive development. Thanks to Rob NVFP4 is starting to live up to its promise on the GB10.

Essentially 35b is too unreliable for me and 27b was to slow; now its competitive with 122b