Why Turboquant saves DGX twice

WilliamD · April 5, 2026, 9:03pm

hey thank you for the work,

Can you please tell me how much context token you inject during the benchtest ?

raphael.amorim · April 6, 2026, 6:25am

:)

serapis · April 6, 2026, 6:52am

That’s very cool. I wonder what 24h of this would do for us and how we could best integrate the result in the vLLM ecosystem

flash3 · April 6, 2026, 10:09am

8k, 32k, 64k, 128k. Tokenrate decreases. There is a result.md in the repos.

flash3 · April 6, 2026, 10:20am

Honestly, we’re already happy if we find any CUDA kernel path for SM12x that actually runs. I also think that this “fallback” to Triton or Python is far too invisible to the average user. Attention is forced through FlashInfer by vLLM — whatever FlashInfer can’t do simply doesn’t work. Take INT4 AutoRound: you need the model (or MultiQuant with iter=0 loading) and can only run it sensibly on Marlin.

vLLM already has some kind of “autotuner” built in — which is frankly terrifying. Way too much “auto”. The autotuner assumes a meaningful candidate space of kernels to tune over. On SM12x, that space is often reduced to a single Triton fallback — or no native path at all. The autotuner then confidently selects the “best” option from a set of one, giving users false assurance that the system is properly optimized.

Triton generates PTX, which is then compiled to the concrete SM architecture via the LLVM backend path. For SM12x this means: Triton can generate kernels that run — but Triton has no native knowledge of SM12x-specific MMA tile sizes, shared memory layout, or warp scheduling. The generated code is functionally correct but not optimal — possibly significantly worse than handwritten CUDA. For simple ops (elementwise, reduction, dequant kernels) this is acceptable — Triton can utilize SM12x reasonably well there. For matmul and attention it hits the same wall as FlashInfer: without explicit MMA instructions targeting SM12x, you stay suboptimal regardless of what the autotuner reports.

flash3 · April 7, 2026, 4:13pm

Weight Quantization: Back to Basics with AutoRound

Up front: Weight quantization using the TQ/RQ solutions does not produce usable results. Therefore, back to the roots — Intel AutoRound. The iter=0 approach is built in as on-the-fly quantization. First measurements on GLM models show:

Calibrated vs RTN (on-the-fly):

	RTN (iters=0)	Calibrated (iters=200)	Δ
INT2	0.53	0.56	+0.03
INT3	0.80	0.82	+0.02
INT4	—	0.94	(Reference)

Calibration improves only marginally (+0.02–0.03). This is disappointing — at INT2, cos=0.56 is probably still insufficient for usable inference.

flash3 · April 8, 2026, 8:33pm

The good news: quant-at-load — quantize BF16 → INT3/INT4 on load — holds its own.

And yes, the numbers show: RTN (iters=0) vs calibrated (iters=1000) is surprisingly close:

	RTN (on-the-fly)	Calibrated (best)	Δ
INT3 Weight cos	0.810	0.820	+0.010
INT3 Logit cos	0.9995	0.9999	+0.0004
INT2 Weight cos	0.562	0.566	+0.004

The 1000 iterations of calibration (autoround best) only yield +1% weight cos for INT3 and virtually nothing for logit cos. This means RTN quant-at-load for INT3/INT4 is a valid approach — especially on DGX Spark where RAM is tight:

No AutoRound run needed
No second model on disk
Load BF16 → quantize on the fly → INT3 in GPU RAM → fused GEMM kernel dequantizes only inside the GEMM

Saves disk, saves calibration time, and quality is nearly identical.

Topic		Replies	Views
DGX Spark GB10 / vLLM 0.19.1: TurboQuant KV cache integration results on Qwen3.5 and Nemotron, including gather-free Triton decode and CUDA WPH decode DGX Spark / GB10 Projects nemotron	5	688	April 7, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	224	8062	April 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6076	March 28, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	634	March 20, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	43	7594	April 3, 2026
KV Cache Quantization Benchmarks on DGX Spark — q4_0 vs q8_0 vs f16 (llama.cpp, Nemotron 30B, 128K context) DGX Spark / GB10 Projects jetson , llama , nemotron	2	284	April 1, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4328	March 16, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2376	March 26, 2026
Why 200 tok/s is new normal? — TP=2 Does Scale After All DGX Spark / GB10 Projects	20	1228	March 19, 2026
Qwen3.5-122B-A10B on single Spark: 38.4 tok/s (patches + benchmark included) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	130	2307	April 9, 2026

Why Turboquant saves DGX twice

Weight Quantization: Back to Basics with AutoRound

Related topics