@flash3 It’s interesting that you say that NVFP4 is inferior to INT4 in terms of accuracy. I was waiting for NVFP4 support with perhaps the incorrect expectation that it would improve accuracy. If it won’t then I’ll stop awaiting it. Would you happen to be able to point to some citation for my reference?
@Albond in his original comment said “if someone with serious compute were to re-quantize with nsamples=256 and more calibration iterations, the quality improvement would be significant.” I saw that Intel does have a project where they distribute the autoround utilities. How much compute are we talking? It might be interesting to do as you say for our own “optimized” form of these models.
Finally, I’ll mention I tried a test where I ran the 122B with DFlash model of 35B, hoping that, while not optimal, that it would work. It did not. I have the impression that the draft models are closely intertwined with the model they draft for.
Thank you very much @Albond@flash3 , we certainly don’t want to lose precision. @eugr has a 397b configuration in the ring; there’s room for drafting there, but it was losing performance. Perhaps with these optimizations, we’ll have more stability and speed without ever losing precision.
● Repository Summary
A new repository has been created at Qwen3.5-397B-A17B-AR-INT4 with the following features:
Repository Structure
Qwen3.5-397B-A17B-AR-INT4/
├── docker/
│ ├── Dockerfile.v2
│ └── entrypoint-v2.sh
├── patches/
│ ├── 01-hybrid-int4-fp8/
│ │ ├── inc.py
│ │ └── build-hybrid-checkpoint.py
│ ├── 02-mtp-speculative/
│ │ └── add-mtp-weights.py
│ └── 03-int8-lm-head/
│ └── patch_int8_lmhead.py
└── install.sh
Docker Image
- Name: vllm-qwen35-397b-v2:latest
- Size: 19.1 GB
- Base: vllm-sm121:0.19.1 (SM121/Blackwell)
Included Optimizations
1. Hybrid INT4+FP8: Dispatch for shared expert dense layers
2. INT8 LM Head v2: Triton kernel for output layer
3. MTP Speculative Decoding: Weights for speculative decoding
4. 512 Experts Support: Adapted for 397B architecture
Key Differences from 122B Model
- 512 experts (vs 256 from 122B)
- 397B total parameters
- 17B active parameters
- build-hybrid-checkpoint.py configured to use Qwen/Qwen3.5-397B-A17B-FP8
Next Steps (to build and launch)
cd /Qwen3.5-397B-A17B-AR-INT4
./install.sh --launch
Credits
- thanks @albond
- Built with: Claude Code
- Model: Qwen3.5-397B-A17B-int4-AutoRound (Intel AutoRound quantization)
- Status: Repository and Docker image built successfully. Pending testing.
The repository is ready and the Docker image was built successfully without modifying the original 122B repository.
Yes please, that would be awesome.
On a related note, I’d like to look at doing that to 397B model too, as current one on huggingface has been created with --disable_opt_rtn switch, which means its not calibrated and could be even better at the same size, not sure how much vram that would require though.
Let me precise this. NVFP4 in the form of a quantized model. If you train a model in NVFP4, it will perform great. But compressing linear dynamics onto a logarithmic scale presupposes assumptions about the underlying weight distributions. In some math benchmarks, NVFP4 failed against INT4 AutoRound; independently of that — and this is where it gets more empirical — models with INT4 seem to respond better. There was already a small consensus on this in the forum. But you have to have actually completed tasks with both variants to judge.
You’re absolutely right about the analytical gap — and I appreciate you pointing it out directly, because it’s something I see constantly too.
To be clear about what we published: v2.1 was intentionally practical. The goal was to give people a working, reproducible pipeline — clone, build, run, get 51 tok/s. No exotic dependencies, no “works on my machine” asterisks. That’s the public-facing mission, and I think it landed well.
But privately, I enjoy the deeper analysis. Even when every path hits the same 273 GB/s wall and the same SM121 constraints, the process of working through why something doesn’t transfer — why a 3.4x on H100 might be a 1.1x or even a regression on Spark — that’s where the real understanding lives. You learn more from a well-analyzed dead end than from a lucky benchmark.
Right now I’m cataloging speculative decoding approaches (DART, DFlash block diffusion, Jacobi forcing, STree for tree verification on SSMs) and almost every promising idea crashes into the same wall: Qwen3.5’s hybrid architecture. 75% of layers are DeltaNet (linear attention with matrix-valued recurrent state), which breaks the assumptions behind tree-based verification, Jacobi-style parallel decoding, and most draft-and-verify schemes. The KV-cache world has 15 years of tricks; the SSM/linear-attention world has almost none.
It’s frustrating and fascinating in equal measure.
Your point about the Marlin kernel being well-optimized for SM12x matches what we see — 84% bandwidth utilization, there’s genuinely nothing left to squeeze there. Which is exactly why the interesting question shifts from “faster kernels” to “fewer reads per useful token.” Whether that’s achievable on this hardware with this model is an open question I haven’t answered yet, but I’d rather chase it honestly than publish another “5x speedup” that only works on a 0.8B model with 1.8 TB/s.
And honestly — even if none of these paths pan out for Spark specifically, the landscape right now is genuinely exciting. The amount of creative work happening in speculative decoding, diffusion-based drafting, and hybrid SSM+attention serving is remarkable. It just needs more people willing to do the math before the benchmarks.
I planning on trying autoround for nvfp4. I think the reason the qwen models perform so well under int4 quantization is because all of the internal layers are quantized and made smaller, whereas in the typical nvfp4 quantization we’ve seen so far, many layers are left as bf16. Calling a model nvfp4 quantized has been a big misnomer given how much of the bandwidth is still occupied by the remaining parts of the model that are unquantized.
There are a lot of models providing also mixed BF16 (shared experts) and INT4 (routed experts). You can select this when running AutoRound. And to my surprise, simple RTN is only 0.2 cos worse in comparison to AutoRound iter=200. It would be interesting to see how you map/rescale the dynamic to NVFP4. And what do you think makes it better in the packed format (NVFP4)? There is no hardware support for this promotion vehicle.
lol @ “promotional vehicle” – not entirely incorrect. nvfp4 can have real benefits, though, and it has been working well on data center cards for some time. There have been two cardinal sins that have made it perform like crap/not work at all on sm120/121 cards. 1) 99kb of smem vs 228kb on the data center cards, and 2) the missing dequantization instructions (fixed in cuda 12.9). There have been some important changes to vllm, flashinfer and cutlass in the last few weeks to really improve nvfp4 support on consumer grade + spark hardware, so I think it’s worth a try now. Using nvfp4 in marlin has not been performant because the fp4 instructions get decompressed at runtime into 16-bit data, and that uses 16-bit hardware and takes commensurate amounts of memory bandwidth to use – hence no benefits. Having 4-bit fp use 4-bit activations and 4-bit dedicated hardware and travel through the memory as 4-bits can probably yield some improvement.
The value of nvfp4 in the data center is real, and the format is legitimately better, so, I think it’s worth a college try. The hardware support is there now, even if somewhat less than optimal. The spark really occupies a strange space – it’s a machine that has a massive amount of slow memory, so it also stands to gain the most from nvfp4 of any blackwell series machine (aside from jetson?) if that makes sense. The memory bandwidth even in things like 5070s are going to be 2x higher so as long as the model that’s being used fits in memory, it’s probably going to be decently performant. For inference, 5080s 5090s and 6000 pro’s are much more common, and those cards have gddr7 and insane memory bandwidth that just puts the spark to shame.
So, overall, I do think it’s worth a try. It’s a much better format than int4, and so it should be able to get at least on par with it. I guess we’ll see, though!
Well @Albond its not a benchmark, but I coded all day with the new qwen35-122b-hybrid-int4fp8 setup and the speedup is really noticeable. Getting to the point where I have a new prompt ready just as the current job is done. So I didn’t find myself waiting much at all.
What was hard for the Qwen3.5 model today was coding functions containing prompts – so a high risk of token and instruction entanglement, but there was very little evidence of this occurring. A few odd tool choices, but most models struggle with long contexts containing these kinds of coding tasks.
The main point is I ran 4 concurrent sessions, limited each to about 130k context tasks before refreshing, and I saw one failed tool call that caused a halt. In my experience thats excellent. It felt like programming regular Qwen3.5 Int4 AutoRound FP16 KV just twice as fast.
Brilliant! Well done everyone for helping Albond figure this one out. What an unexpected bonus!
Also note that my observations are probably loaded with confirmation bias. I have been improving my prompts following Dex Horthy’s CRSPI technique over the past weeks. Learning to work with Qwen3.5 rather than fight with it, so I am using a 8 step chain: questions → research → design → structure → multi-plan → contract → implement ↔ validate GAN loop leading to early alignment, smaller prompts, improved reliability and task completion – so significant parallel improvements complementing the model speed gains.
So you know someone who uses NVFP4? For business? Don’t name him here (protecting his reputation). Or do you plan to own datacenter hardware and speculate on… performance gain compared to… BF16 or Marlin? Marlin is INT only, no packed NVIDIA surprise values. I think NVIDIA placed this wonder weapon to own the 4-bit discussion. Look into the history of NVFP4. The spec came first, a lot of quant algos followed and still kept falling behind INT4 AutoRound (or simple RTN iter=0). Impossible! No one pays for simple answers.
I am (re)writing an AI client in Python that is driven by a custom Hierarchical State Machine engine and writes code for the HSM engine. I’ve been using @Albond’s gift to code it over the last couple days. Things are working quite well. Tool calls have been a slight problem (sending strings for edit_file offsets rather than Integers and messing up the Python list format when creating multiple directories). I changed the AI client to coerce the values correctly and changed the create_directory tool to accept a single string parameter instead of a list (make one directory at a time). This has fixed the problems so far. Things slow down as the roll gets longer but that is expected. Qwen is having to modify HSM maps in a database as well as writing Python code and it’s working well. This means I can actually keep the Spark and get use out of it rather than returning it before the 30 days window is up. I am VERY thankful for this. The box will pay for itself in a couple months vs. spending money on tokens. Many thanks!
@Albond I read on your github: the Production recipe enabled prefix caching. The “What didn’t work” section includes prefix caching. This is confusing. Will it work?
It’s perfectly valid to hate nvfp4 because it breaks interoperability with gpus of other manufacturers. The integer pipelines are older, more mature, and probably more performant out of the box. That doesn’t mean exploring nvfp4 and spending time to optimize it won’t yield real results.
NVFP4 has been out a year and is only supported on one class of GPUs.
I use it for local inference, and it’s fine. I’m not going to die on this hill, because it doesn’t matter that much, but, mathematically, nvfp4 is superior to the 4-bit int because integers are spaced evenly throughout the space whereas floats can be made to look more like the actual distribution of the weights.
At best, it’s immature, but it’s also better than mxfp4 because the block scaling uses full-precision floats. With optimization, this format should be be both better and faster than integer. Isit today? Probably not. Could it? Probably, and measurably so.
I don’t think it’s hype that nvfp4 is almost as good as full-precision bf16 when the quantization is done right.
I’m running experiments now to understand which model layers are the most sensitive to quantization and how to balance what part of the model to leave in nvfp4 and which to keep at fp8 or 16.
yes, for the dense core of the weight distribution — roughly ~95% of all weights clustered near zero — a non-uniform, say logarithmic, quantization grid actually maps the distribution better than INT4 with uniform spacing. Near zero, you get finer resolution than INT4, and that’s exactly where the mass of the weights sits.
But research has shown that in large models, individual feature dimensions with extremely high activation values emerge systematically, called “emergent outliers.” These typically account for only ~0.1% of dimensions, yet contribute disproportionately to model quality (see Dettmers et al., 2022 — LLM.int8()).
INT4 with per-group scaling (as used in AutoRound) has a decisive advantage here: the scaling factor per group adapts the uniform grid to the local distribution. If a group contains an outlier, the scaling factor increases — you lose precision on the small values within that group, but the outlier is at least captured correctly.
And yes again: NVFP4 with block scaling follows a similar concept in principle, but the combination of a logarithmic grid and coarse block-level scaling results in being suboptimal in both regimes — you neither represent the dense core as well as you could, nor do you handle the outliers gracefully.
So NVFP4 sounds good. It isn’t. To be fair - it’s good, but not as good as the N makes you think it is.
It’s the same with pruning. You can’t just delete all experts that are called seldom. There are experts that contribute significantly when called. So you have to keep them.
Hm … yes, I face crash on Mamba layers: AssertionError: In Mamba cache align mode, block_size (4176) must be <= max_num_batched_tokens (2048)
My default does not use any coding tools. So, I put recommendation how to start/stop from whpthomas post (I see his experience most close to others), but LLM has many personal specific … I don’t think there is one best way for all cases.
Let me update, thank you.