What's the best speed we can get with Qwen 3.6 27B without quantizing?

starkrun · April 23, 2026, 1:40am

So it’s looking like this might be the best local model we’re likely to get in 2026 (EDIT: runnable by people who aren’t oil sheiks), except maybe Qwen 3.6 122B.

I’m curious the max inference speed I can get out of it on a Spark. We all know about quantization, so that’s not what I’m asking here. Sticking to a high-quality baseline of FP8 model weights and unquantized KV cache, other than MTP can anything be done to improve throughput? Some experimental CUDA flag? Some VLLM flag?

Below are the numbers I got in my simple benchmark.

Environment

Model: Qwen/Qwen3.6-27B-FP8
VLLM: latest main build from today (via spark-vllm-docker)
Base command (without MTP):

vllm serve /modelmountpoint \
    --served-model-name "Qwen3.6-27B" \
    --max-model-len 32000 \   #I needed VRAM for something else, don't use this obviously
    --max-num-seqs 4 \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.4 \  #same
    --port 8000 \
    --host 0.0.0.0 \
    --load-format fastsafetensors \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --max-num-batched-tokens 8192 \
    --trust-remote-code \
    --mm-encoder-tp-mode data \
    -O3

When adding MTP, I added --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
I benchmarked with a basic openai pip package wrapper script I vibecoded
The two prompts I tested, representing 2 different tasks, are:
- Write me a commented quicksort function in Python
- Write me a 2-page essay on phone addiction

Results:

Configuration	Quicksort (tok/s)	Essay (tok/s)	Average (tok/s)	Speedup vs Baseline
Baseline (no MTP)	7.8	7.8	7.8	1.00x
`num_speculative_tokens=2`	13.4	14.8	14.1	1.81x
`num_speculative_tokens=3`	14.5	15.8	15.2	1.94x ← best
`num_speculative_tokens=4`	13.7	14.9	14.3	1.83x

josephbreda · April 23, 2026, 2:03am

The math works like this:

At FP8 you are moving 27B parameters X 1 byte per pass = ~ 27GB. You have ~270GB/s bandwidth maximum. Divide 270GB by 27GB and you get about 10 tokens per second max throughput, which is pretty close to your baseline.

The MTP trick is actually giving you a very nice “free” boost. I’m not sure you are going to see much better over most workloads.

To see why everyone is all hot and bothered to get proper NVFP4 support — you would only have to move about 7GB per pass – and should get a theoretical 38 or so tokens per second with very minimal loss of quality.

tenari · April 23, 2026, 2:39am

Not all layers are created equal. In Prismaquant, we detect the sensitive layers and leave them in BF16, and take the insensitive layers and make them nvfp4. Quantization isn’t bad if you know how to use it.

I dare say a model that’s 25% bf16 and 75% nvfp4 is better than one that is fb8 through-and-through. I’ve done a lot of research to show this. Take a look at the model card I created for the latest version of Qwen3.6 Prismaquant: rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm · Hugging Face

I have a section that compares various quantizations:

Target bpp	Achieved bpp	Predicted Δloss	NVFP4 / MXFP8 / BF16	vs 5.5 bpp
4.5	4.500	948	416 / 1 / 0	+99% Δloss, −18% size
4.75	4.750	704	373 / 12 / 32	+48% Δloss, −14% size
5.0	5.000	604	347 / 14 / 56	+27% Δloss, −9% size
5.25	5.250	532	321 / 20 / 76	+12% Δloss, −5% size
5.5	5.500	477	300 / 30 / 87	← this artifact
6.0	6.000	393	270 / 35 / 112	−18% Δloss, +9% size
7.0	7.000	276	211 / 62 / 144	−42% Δloss, +27% size
8.25	8.249	180	152 / 73 / 192	−62% Δloss, +50% size

You can see that on a 16-bit model, targeting an 8.25 bit quantization can get you great results – nearly that of BF16 at half the cost.

starkrun · April 23, 2026, 4:36am

Sorry but I just don’t trust that NVFP4 is something I should be using.

Making NVFP4 quants is non-trivial. Who is making them today?

The majority are made by amateurs on HF (never posting any benchmarks, often not even testing if their model even works). You can forget about this. NVFP4 requires an extra training round and that means spending on GPUs for a few days of training, having a great dataset to begin with, and knowing what you’re doing.
Nvidia themselves, with benchmarks showing NVFP4 is the same or better as FP8. And yet their NVFP4 has a worse KLD than AWQ 4-bit? ( Reddit - Please wait for verification ) . Is it that the extra training turning it into some sort of mini-finetune and KLD is no longer a good metric for loss of quality? idk.
No first-party AI lab is bothering to make NVFP4 quants. Not even American models like Gemma 4. It’s just a few days of extra work and yet the AI experts don’t care about it. I think that says something.

gc999999 · April 23, 2026, 5:57am

qwen3.6-27B surprises me, it finds immedaitely the bugs 35B struggles in loop, and get the bugs fixed within few rounds of debug process by itself. now i run qwen3.6-27B(Q4_K_M) in win11/RTX3090 with a acceptable speed, but i know it might become a nightmare to have 27B run in Spark. is there any way to have +30 tok/s in Spark?

DannyTup · April 23, 2026, 7:13am

I don’t know how well it translates to real work, but I was very surprised when running AgentBench that the dense model was barely slower than the MoE for a better score, see timings here:

I had assumed the MoE would be the obvious choice, but if this isn’t a fluke and it happens in real work, the dense one is the way to go. Those numbers are all with the same settings (MTP2 but nothing else).

(I do plan to add dflash and some of the other quant formats when I have time)

yaro.tal · April 23, 2026, 8:37am

Short answer: no. The math doesn’t math, we don’t have enough memory bandwidth for that.

starkrun · April 23, 2026, 2:26pm

I totally believe this. I know people were praising the 3.6 MoE, but I preferred 3.5 dense even before.

The MoE can follow instructions that already correctly identify what should be done, but it can’t come up with a good plan on its own for a non-trivial task. It’s not smart enough. It tries to solve things at the wrong layer (for example, I had a JSON escaping issue from bash → docker → ‘bash -c’ causing VLLM’s json.loads to give an error, it spent the first 10 minutes reading the source code of VLLM inside the 3rd party docker container and trying to patch based on that).

This is infuriating to use interactively. I’d rather wait on a slower but smarter dense model, that requires fewer guidance interactions from me. And as seen in your benchmark, actually finishes the task faster because it makes fewer mistakes.

I’m surprised the benchmark scores between MoE and dense are that close in your test tbh. It’s just not what I observed in reality.

DannyTup · April 23, 2026, 3:18pm

One benchmark probably isn’t a great indicator tbh, I really need to add some more. I will review the logs at some point too, but I don’t want to review loads of logs for lodas of models, so I’m hoping that I can use the numbers to narrow things down to a smaller selection that are worth reviewing (at which point I’d like to play around with the different flags/spec-decoding options, and maybe try to run the benchamrks through different harnesses to see how different tools affect things).

jwarner · April 23, 2026, 3:51pm

So you don’t consider Nvidia a first party AI lab?

“They aren’t making them” is a logical fallacy when Nvidia will do it for the big players. Major releases get NVFP4 quants done for them, released openly, and embedded into NIMs. Why would they?

NVFP4 is a 4 bit quant and like all such requires calibration data. Why? Because there is some loss. Calibrate your own if concerned. Or use FP8. But it isn’t “bad” - it’s a tool.

starkrun · April 23, 2026, 5:03pm

Tailored harnesses definitely pump up the score. Someone wrote a dedicated harness aimed at small local models (ie prompt, smart auto-retry strategy, detecting thinking loops) had the score of 3.5 9B go up from 19% to 45% in Aider Polyglot without/with the harness.

With Qwen3.6-35B-A3B UD-Q4_K_M on TerminalBench, he got 40%, which is the score Sonnet 4.5 got when used with Claude Code, which is the most likely use-case for developers doing agentic tasks.

Turrican · April 23, 2026, 5:03pm

Has anyone created an recipe (eugr/spark-vllm-docker) for the model?

DannyTup · April 23, 2026, 5:16pm

Nice, this is the sort of thing I keep thinking about. Like there are many different possible implementations of an edit tool for modifying files. It would be fun to run evals across lots of variations and see how they affect things, but first I need to figure out a good model to start with. While the results might differ between models, trying every combination would take forever, so picking a good base model to start with and then iterating on tools/harness would probably be more reasonable.

aostang · April 23, 2026, 5:51pm

Saying only amateurs are creating NVFP4 is far from accurate.

sjug · April 23, 2026, 5:59pm

Those are for MacOS/MLX

aostang · April 23, 2026, 6:06pm

My point against your statement is that Unsloth does create NVFP4 quants and he almost unarguably does more quantization than anyone even including the actual model providers. What the Spark community is facing is that very few people target our hardware, NVFP4 or otherwise. This is sadly especially true for the largest models. It’s why I’ve been hoping we can get improved GGUF support since that’s what almost everyone is releasing in. At least we could use those until a more optimal quant format is generated by someone for the given model that we can actually use.

Teason2026 · April 23, 2026, 6:56pm

This is one of the enterprise vendor for Inference as a Service. NVFP4 very popular for Blackwell generation as it gives lot more performance from limited corp hardware (clouds different story). But spark is sm12x not sm100 (actual Blackwell).

Overall, best NVFP4 done same way as REAP, tuned to message history for that particular use case.

aceangel · April 23, 2026, 7:27pm

RedHatAI also seems reputable, they also produce NVFP4 quants. On the amateur side- it’s relatively easy to do (easy to quantize to NVFP4), provided you have all the data you need. There are a lot that work just fine.

aostang · April 23, 2026, 7:43pm

The issue is the amount of RAM you need in order to quantize the largest models. Trying to quantize something like GLM or Kimi can take 2TB of RAM. I tried doing it with 192GB and disk swap and even Claude and Gemini gave up after trying everything they could think of, saying it wasn’t going to happen and to just rent a RunPod with 2TB so that everything stayed in RAM.

tenari · April 24, 2026, 2:06pm

Prismquant handles quantization incrementally. Try it! I spent $100 a few weeks ago renting an H100 trying to quantize qwen3.5 122B fully to nvfp4 before doing this prismquant work. If you just want pure quantization you can ask Claude or codex to disable all the other bits; but those parts are great as well.

I can do massive models locally with 80-90 gb of ram. The more ram you have the better. This is on a spark, btw. I’ve quantized the full minimax 2.7 with memory to spare.

Also: re nvfp4. First party labs probably aren’t doing it because it’s so new. Nvidia is doing it for nemotron, but the Chinese labs aren’t doing it because they literally can’t get blackwell, and even if could, the addressable market is too small. Deepseek released V4 yesterday with an mxfp4 expert layer since that format is a bit more standard and Blackwell compatible, but also regarded as inferior to nvfp4.

Is nvfp4 perfect and all there yet? No, but it’s also < 1 year old. This community has been integral in getting support for it into vllm, cutlass, flashinfer etc and will continue to be so.

Rob

Topic		Replies	Views
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	299	27605	June 26, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	309	27795	June 22, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	434	22278	June 24, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11456	April 9, 2026
Introducing PrismaQuant DGX Spark / GB10	166	6617	June 12, 2026
Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark, Jetson Thor, Blackwell 6000 Pro DGX Spark / GB10 Projects	10	2912	June 2, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	17015	March 24, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	6094	March 16, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	6406	May 4, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2849	May 11, 2026

What's the best speed we can get with Qwen 3.6 27B without quantizing?

Environment

Related topics