Why Turboquant saves DGX twice

Digital_David · March 28, 2026, 3:23pm

So maybe a 300B BF16 model? @flash3 can you give a large one a try and post the results? Seems all of the results I’ve seen so far have been based on fairly small models to start with, definitely not anything maxing out the Spark.

paulsc.liu · March 28, 2026, 3:24pm

This llama-cpp TQ3 will work

./bin/llama-bench -m GLM-4.7-Flash-Q4_K_M.gguf -ctk f16,tq3_0 -p 512,4096 -n 128,512

model	size	params	backend	ngl	type_k	test	t/s
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	f16	pp512	633.08 ± 2.40
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	f16	pp4096	456.35 ± 0.87
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	f16	tg128	28.72 ± 0.40
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	f16	tg512	26.09 ± 0.03
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	tq3_0	pp512	474.77 ± 2.05
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	tq3_0	pp4096	304.49 ± 2.32
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	tq3_0	tg128	7.68 ± 0.06
deepseek2 30B.A3B Q4_K - Medium	17.05 GiB	29.94 B	CUDA	99	tq3_0	tg512	6.59 ± 0.03

paulsc.liu · March 28, 2026, 3:29pm

Some people claimed QJL actually hurts performance in practice.

paulsc.liu · March 28, 2026, 3:40pm

Near-optimal weight quantization with on-the-fly dequantization for LLM inference.

trystan1 · March 28, 2026, 7:33pm

My guess is that based on what nvidia published they have a version using nvcomp (not sure what algo)

nvCOMP | NVIDIA Developer

mikee.gwu · March 28, 2026, 11:18pm

so this is obviously interesting but is this academic work or can this type of quant now be applied to existing models to then have them work with less memory/movement for inference?

In other words, is the idea that you apply this TQ to say nemotron-3s FP16, quant it down, and then run inference with vLLM and get better performance because of the x-times memory compression of the KV?

How is this approach different from the TurboQ/RotorQ functionality that vLLM is working on? Is it just so that vLLM can do in-line quant at load/inference time and bypass the 1-time quantization step?

jwarner · March 29, 2026, 1:54am

Looking into my crystal ball:

We’ll get TQ or other polar math cache first. Longer contexts or bigger models squeezed in.
We will investigate TQ or similar runtime quants of model weights.
Top to bottom quants for both model and weights will arrive, verifiably reducing memory footprint and may actually decrease performance a bit at first, but will be accepted to fit bigger models at higher quality into RAM. The regression in performance vs memory bandwidth will be mitigated by less bandwidth required.
Optimized kennels and math for these rotary operations will arrive piecemeal and need to permeate the various systems and projects, which will take time, but ultimately realize the throughout potential with goal to saturate memory bandwidth.

We are ideally positioned to take advantage of this on Spark, with a big but slower RAM pool. The smaller tensor core memory may end up working out better with compressed polar math.

trystan1 · March 29, 2026, 2:03am

I’d tend to agree, I think the prefill/compute performance on the spark is amazing when combined with the amount of RAM/price. The Achilles heel memory bandwidth is something that can (With time) be circumvented with software. Just in the last year alone inference has come a long way in terms of quantization quality/accuracy.

flash3 · March 29, 2026, 9:14am

Look, there was never a free lunch. And even if it looks like there was, karma jumped in later — bigger, more painful.

So what do we have here? A new approach: quantizing on load takes more time than just loading, but you save all the pre-quant file versions and distribution overhead. And if you’ve chosen ONE model that fits your use case, you can live with a slightly longer load time — how often do you actually reload?

Then we have the memory-bound issue, mainly on the minimalist datacenter box that has recently hit the market. Compute saves memory. And this is true — we have proven it. Whether this holds on RTX and consumer GPUs is hard to say at this point.

And then there is competition in model architecture: quantization is one part, pruning/REAP is another. So all of you get to become an architect who can shape the model itself. TQ could be a very good asset here, because it works almost independently — and “almost” is exactly where MultiQuant comes in.

The biggest tragedy is that all the commercially driven projects — vLLM, NVIDIA, FlashInfer — are obviously not interested in this hyped datacenter box.

I developed this because I needed it, because of the limitations I was facing. This is not about pure Python one-liner implementations for 0.8B models for a social media post — it’s a CUDA kernel-driven, high-performance MultiQuant framework with support for tensor and expert parallelism and all the relevant models above 350B parameters.

flash3 · March 29, 2026, 9:22am

You can clone the repo, build and start it yourself.

[GitHub - flash7777/vllm at multiquant · GitHub]

TQ runs like hell (see measurements above). RQ is not quite there yet — minor issues. The math is fine, but the Clifford rotation is not CUDA graph-safe, so I’m testing some ideas to fix that. (Only worth pursuing if it’s fast enough.)

On-the-fly weight quantization works too.

Check the start_multiquant script — it should be self-explanatory.

trystan1 · March 29, 2026, 4:48pm

Pull requests · vllm-project/vllm

Slight decode hit offset by 4x the context,

flash3 · March 29, 2026, 5:10pm

2bit is sporty. tokenrate sinks a bit on rtx, but on dgx it rises (compute vs mem).

trystan1 · March 29, 2026, 5:14pm

That PRs triton kernel crashed for me with Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ on spark.

Did you get it working?

flash3 · March 29, 2026, 5:19pm

sorry, but the problems with merging something in vLLM and/or in forks of vLLM … that IS the problem. I spent 2 hours to get TurboQuant into PyTorch, 1 day for CUDA, but 3 days to fix all the CUDA graph issues, FlashInfer ghost calls, and all the other surprises. And I wasn’t surprised when I saw: ‘This PR made extensive changes, which may not get reviewed properly as review is very hard.’ in the comments of the pull request you posted.

trystan1 · March 29, 2026, 5:23pm

Ah, I see you were talking about your fork not the PR. My mistake.

flash3 · March 29, 2026, 5:26pm

flash3 · March 29, 2026, 5:32pm

by the way. triton may crash in many cases because of this sm size problem. i dont know if it is solved in latest vllm.

trystan1 · March 29, 2026, 5:35pm

Same with flashinfer :(

So many blockers everywhere I go.

Every move I make, every breath I take, 99KB crashing me.

flash3 · March 29, 2026, 5:40pm

look in vllm-<whatever>/build/moe-configs/ — in the specific file you’ll find num_stages=4. That’s too much. I don’t know why this is specified. Any default would have been understandable, but this is broken by spec. OMG.

OK, it’s model-dependent. For a 0.8B model you can just roll the dice on your config — maybe 6 or 8. But the top “league” of models in this minimalistic cosmos… gets broken.

flash3 · March 30, 2026, 2:56pm

some findings I’d like to share. Running GLM 4.7 leads to some problems. D=256 (BF16 or FP8), Triton fails (fixed).

Then the group size has larger meaning for reconstruction capabilities - quality improvement could be dramatic:

D=256 TQ3: cos 0.78 → 0.96 with gs=64
D=256 TQ4: cos 0.82 → 0.99 with gs=32
D=128 TQ3: cos 0.81 → 0.96 with gs=32

Because this is a purely statistical reconstruction measurement, any outlier disappears statistically but is not irrelevant.

Topic		Replies	Views
DGX Spark GB10 / vLLM 0.19.1: TurboQuant KV cache integration results on Qwen3.5 and Nemotron, including gather-free Triton decode and CUDA WPH decode DGX Spark / GB10 Projects nemotron	5	1598	April 7, 2026
MiniMax M2.7 TQ3 - A TurboQuant 3-bit quantized version of MiniMax-M2.7 for single DGX Spark DGX Spark / GB10 cuda	4	2549	April 28, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10277	April 9, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	11876	May 15, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	26	11177	April 29, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	7818	March 28, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	724	March 20, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1342	May 11, 2026
KV Cache Quantization Benchmarks on DGX Spark — q4_0 vs q8_0 vs f16 (llama.cpp, Nemotron 30B, 128K context) DGX Spark / GB10 Projects jetson , llama , nemotron	2	1050	April 1, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5497	March 16, 2026

Why Turboquant saves DGX twice

Related topics