hey thank you for the work,
Can you please tell me how much context token you inject during the benchtest ?
hey thank you for the work,
Can you please tell me how much context token you inject during the benchtest ?
:)
That’s very cool. I wonder what 24h of this would do for us and how we could best integrate the result in the vLLM ecosystem
8k, 32k, 64k, 128k. Tokenrate decreases. There is a result.md in the repos.
Honestly, we’re already happy if we find any CUDA kernel path for SM12x that actually runs. I also think that this “fallback” to Triton or Python is far too invisible to the average user. Attention is forced through FlashInfer by vLLM — whatever FlashInfer can’t do simply doesn’t work. Take INT4 AutoRound: you need the model (or MultiQuant with iter=0 loading) and can only run it sensibly on Marlin.
vLLM already has some kind of “autotuner” built in — which is frankly terrifying. Way too much “auto”. The autotuner assumes a meaningful candidate space of kernels to tune over. On SM12x, that space is often reduced to a single Triton fallback — or no native path at all. The autotuner then confidently selects the “best” option from a set of one, giving users false assurance that the system is properly optimized.
Triton generates PTX, which is then compiled to the concrete SM architecture via the LLVM backend path. For SM12x this means: Triton can generate kernels that run — but Triton has no native knowledge of SM12x-specific MMA tile sizes, shared memory layout, or warp scheduling. The generated code is functionally correct but not optimal — possibly significantly worse than handwritten CUDA. For simple ops (elementwise, reduction, dequant kernels) this is acceptable — Triton can utilize SM12x reasonably well there. For matmul and attention it hits the same wall as FlashInfer: without explicit MMA instructions targeting SM12x, you stay suboptimal regardless of what the autotuner reports.
Up front: Weight quantization using the TQ/RQ solutions does not produce usable results. Therefore, back to the roots — Intel AutoRound. The iter=0 approach is built in as on-the-fly quantization. First measurements on GLM models show:
Calibrated vs RTN (on-the-fly):
| RTN (iters=0) | Calibrated (iters=200) | Δ | |
|---|---|---|---|
| INT2 | 0.53 | 0.56 | +0.03 |
| INT3 | 0.80 | 0.82 | +0.02 |
| INT4 | — | 0.94 | (Reference) |
Calibration improves only marginally (+0.02–0.03). This is disappointing — at INT2, cos=0.56 is probably still insufficient for usable inference.
The good news: quant-at-load — quantize BF16 → INT3/INT4 on load — holds its own.
And yes, the numbers show: RTN (iters=0) vs calibrated (iters=1000) is surprisingly close:
| RTN (on-the-fly) | Calibrated (best) | Δ | |
|---|---|---|---|
| INT3 Weight cos | 0.810 | 0.820 | +0.010 |
| INT3 Logit cos | 0.9995 | 0.9999 | +0.0004 |
| INT2 Weight cos | 0.562 | 0.566 | +0.004 |
The 1000 iterations of calibration (autoround best) only yield +1% weight cos for INT3 and virtually nothing for logit cos. This means RTN quant-at-load for INT3/INT4 is a valid approach — especially on DGX Spark where RAM is tight:
Saves disk, saves calibration time, and quality is nearly identical.