Why Turboquant saves DGX twice

hey thank you for the work,

Can you please tell me how much context token you inject during the benchtest ?

:)

2 Likes

That’s very cool. I wonder what 24h of this would do for us and how we could best integrate the result in the vLLM ecosystem

8k, 32k, 64k, 128k. Tokenrate decreases. There is a result.md in the repos.

2 Likes

Honestly, we’re already happy if we find any CUDA kernel path for SM12x that actually runs. I also think that this “fallback” to Triton or Python is far too invisible to the average user. Attention is forced through FlashInfer by vLLM — whatever FlashInfer can’t do simply doesn’t work. Take INT4 AutoRound: you need the model (or MultiQuant with iter=0 loading) and can only run it sensibly on Marlin.

vLLM already has some kind of “autotuner” built in — which is frankly terrifying. Way too much “auto”. The autotuner assumes a meaningful candidate space of kernels to tune over. On SM12x, that space is often reduced to a single Triton fallback — or no native path at all. The autotuner then confidently selects the “best” option from a set of one, giving users false assurance that the system is properly optimized.

Triton generates PTX, which is then compiled to the concrete SM architecture via the LLVM backend path. For SM12x this means: Triton can generate kernels that run — but Triton has no native knowledge of SM12x-specific MMA tile sizes, shared memory layout, or warp scheduling. The generated code is functionally correct but not optimal — possibly significantly worse than handwritten CUDA. For simple ops (elementwise, reduction, dequant kernels) this is acceptable — Triton can utilize SM12x reasonably well there. For matmul and attention it hits the same wall as FlashInfer: without explicit MMA instructions targeting SM12x, you stay suboptimal regardless of what the autotuner reports.

Weight Quantization: Back to Basics with AutoRound

Up front: Weight quantization using the TQ/RQ solutions does not produce usable results. Therefore, back to the roots — Intel AutoRound. The iter=0 approach is built in as on-the-fly quantization. First measurements on GLM models show:

Calibrated vs RTN (on-the-fly):

RTN (iters=0) Calibrated (iters=200) Δ
INT2 0.53 0.56 +0.03
INT3 0.80 0.82 +0.02
INT4 0.94 (Reference)

Calibration improves only marginally (+0.02–0.03). This is disappointing — at INT2, cos=0.56 is probably still insufficient for usable inference.

1 Like

The good news: quant-at-load — quantize BF16 → INT3/INT4 on load — holds its own.

And yes, the numbers show: RTN (iters=0) vs calibrated (iters=1000) is surprisingly close:

RTN (on-the-fly) Calibrated (best) Δ
INT3 Weight cos 0.810 0.820 +0.010
INT3 Logit cos 0.9995 0.9999 +0.0004
INT2 Weight cos 0.562 0.566 +0.004

The 1000 iterations of calibration (autoround best) only yield +1% weight cos for INT3 and virtually nothing for logit cos. This means RTN quant-at-load for INT3/INT4 is a valid approach — especially on DGX Spark where RAM is tight:

  • No AutoRound run needed
  • No second model on disk
  • Load BF16 → quantize on the fly → INT3 in GPU RAM → fused GEMM kernel dequantizes only inside the GEMM

Saves disk, saves calibration time, and quality is nearly identical.

4 Likes