So maybe a 300B BF16 model? @flash3 can you give a large one a try and post the results? Seems all of the results I’ve seen so far have been based on fairly small models to start with, definitely not anything maxing out the Spark.
This llama-cpp TQ3 will work
./bin/llama-bench -m GLM-4.7-Flash-Q4_K_M.gguf -ctk f16,tq3_0 -p 512,4096 -n 128,512
| model | size | params | backend | ngl | type_k | test | t/s |
|---|---|---|---|---|---|---|---|
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | f16 | pp512 | 633.08 ± 2.40 |
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | f16 | pp4096 | 456.35 ± 0.87 |
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | f16 | tg128 | 28.72 ± 0.40 |
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | f16 | tg512 | 26.09 ± 0.03 |
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | tq3_0 | pp512 | 474.77 ± 2.05 |
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | tq3_0 | pp4096 | 304.49 ± 2.32 |
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | tq3_0 | tg128 | 7.68 ± 0.06 |
| deepseek2 30B.A3B Q4_K - Medium | 17.05 GiB | 29.94 B | CUDA | 99 | tq3_0 | tg512 | 6.59 ± 0.03 |
Some people claimed QJL actually hurts performance in practice.
Near-optimal weight quantization with on-the-fly dequantization for LLM inference.
My guess is that based on what nvidia published they have a version using nvcomp (not sure what algo)
so this is obviously interesting but is this academic work or can this type of quant now be applied to existing models to then have them work with less memory/movement for inference?
In other words, is the idea that you apply this TQ to say nemotron-3s FP16, quant it down, and then run inference with vLLM and get better performance because of the x-times memory compression of the KV?
How is this approach different from the TurboQ/RotorQ functionality that vLLM is working on? Is it just so that vLLM can do in-line quant at load/inference time and bypass the 1-time quantization step?
Looking into my crystal ball:
- We’ll get TQ or other polar math cache first. Longer contexts or bigger models squeezed in.
- We will investigate TQ or similar runtime quants of model weights.
- Top to bottom quants for both model and weights will arrive, verifiably reducing memory footprint and may actually decrease performance a bit at first, but will be accepted to fit bigger models at higher quality into RAM. The regression in performance vs memory bandwidth will be mitigated by less bandwidth required.
- Optimized kennels and math for these rotary operations will arrive piecemeal and need to permeate the various systems and projects, which will take time, but ultimately realize the throughout potential with goal to saturate memory bandwidth.
We are ideally positioned to take advantage of this on Spark, with a big but slower RAM pool. The smaller tensor core memory may end up working out better with compressed polar math.
I’d tend to agree, I think the prefill/compute performance on the spark is amazing when combined with the amount of RAM/price. The Achilles heel memory bandwidth is something that can (With time) be circumvented with software. Just in the last year alone inference has come a long way in terms of quantization quality/accuracy.
Look, there was never a free lunch. And even if it looks like there was, karma jumped in later — bigger, more painful.
So what do we have here? A new approach: quantizing on load takes more time than just loading, but you save all the pre-quant file versions and distribution overhead. And if you’ve chosen ONE model that fits your use case, you can live with a slightly longer load time — how often do you actually reload?
Then we have the memory-bound issue, mainly on the minimalist datacenter box that has recently hit the market. Compute saves memory. And this is true — we have proven it. Whether this holds on RTX and consumer GPUs is hard to say at this point.
And then there is competition in model architecture: quantization is one part, pruning/REAP is another. So all of you get to become an architect who can shape the model itself. TQ could be a very good asset here, because it works almost independently — and “almost” is exactly where MultiQuant comes in.
The biggest tragedy is that all the commercially driven projects — vLLM, NVIDIA, FlashInfer — are obviously not interested in this hyped datacenter box.
I developed this because I needed it, because of the limitations I was facing. This is not about pure Python one-liner implementations for 0.8B models for a social media post — it’s a CUDA kernel-driven, high-performance MultiQuant framework with support for tensor and expert parallelism and all the relevant models above 350B parameters.
You can clone the repo, build and start it yourself.
[GitHub - flash7777/vllm at multiquant · GitHub]
TQ runs like hell (see measurements above). RQ is not quite there yet — minor issues. The math is fine, but the Clifford rotation is not CUDA graph-safe, so I’m testing some ideas to fix that. (Only worth pursuing if it’s fast enough.)
On-the-fly weight quantization works too.
Check the start_multiquant script — it should be self-explanatory.
Pull requests · vllm-project/vllm
Slight decode hit offset by 4x the context,
2bit is sporty. tokenrate sinks a bit on rtx, but on dgx it rises (compute vs mem).
That PRs triton kernel crashed for me with Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ on spark.
Did you get it working?
sorry, but the problems with merging something in vLLM and/or in forks of vLLM … that IS the problem. I spent 2 hours to get TurboQuant into PyTorch, 1 day for CUDA, but 3 days to fix all the CUDA graph issues, FlashInfer ghost calls, and all the other surprises. And I wasn’t surprised when I saw: ‘This PR made extensive changes, which may not get reviewed properly as review is very hard.’ in the comments of the pull request you posted.
Ah, I see you were talking about your fork not the PR. My mistake.
by the way. triton may crash in many cases because of this sm size problem. i dont know if it is solved in latest vllm.
Same with flashinfer :(
So many blockers everywhere I go.
Every move I make, every breath I take, 99KB crashing me.
look in vllm-<whatever>/build/moe-configs/ — in the specific file you’ll find num_stages=4. That’s too much. I don’t know why this is specified. Any default would have been understandable, but this is broken by spec. OMG.
OK, it’s model-dependent. For a 0.8B model you can just roll the dice on your config — maybe 6 or 8. But the top “league” of models in this minimalistic cosmos… gets broken.
some findings I’d like to share. Running GLM 4.7 leads to some problems. D=256 (BF16 or FP8), Triton fails (fixed).
Then the group size has larger meaning for reconstruction capabilities - quality improvement could be dramatic:
- D=256 TQ3: cos 0.78 → 0.96 with gs=64
- D=256 TQ4: cos 0.82 → 0.99 with gs=32
- D=128 TQ3: cos 0.81 → 0.96 with gs=32
Because this is a purely statistical reconstruction measurement, any outlier disappears statistically but is not irrelevant.
