Well, hello there!

Imagine you were an old hand around the forum but have been away for a while.

What’s interesting these days?

Let’s re-spark some enthusiasm!

g’day mate and welcome back :),

For interesting areas to explore, my 2 picks would be b12x nvfp4 kernels GitHub - lukealonso/b12x · GitHub + Dflash speculative diffusion decoding GitHub - z-lab/dflash: DFlash: Block Diffusion for Flash Speculative Decoding · GitHub .

Interesting! What draws you to b12x and dflash? What do they do?

Hmm… I think I can answer my own question about DFlash after reviewing the link! 415.7 tokens per second (plus gpt-oss-120b models prepared, tickling a particular interest of mine!)

for b12x, sm120 users report pretty significant throughput improvements (30%-ish) when using nvfp4 instead of flashinfer (from: [SM120][B12X] Add b12x NVFP4 MoE and dense backends by voipmonitor · Pull Request #39634 · vllm-project/vllm · GitHub)

I’d really like an updated custom MXFP4 build for eugr’s repo - for two reasons.

  1. DFlash, as mentioned before, should be incredible for throughput.
  2. Nvidia released GPT-OSS Puzzle a few weeks back which is a pruned and further tuned version of GPT-OSS-120B down to 88B. The model architecture is a little unusual and I’d like to be able to compare it in the same harness to the original 120B.

Welcome back! Yes, the number one request is to fix broken mxfp4 build - right now it only works solo, but not in a cluster configuration.

flash3 /flash7777 is doing some magic

Neat! Where can I read the topline summary of what’s going on in the repo?

is flash related to FlashInfer?

I’m a little embarrassed that I left it in a bad state!

I was my intention to not have my experimentation damage it, but I think I did a force push somewhere and it broke the repo.

I’ll try and put some time aside this weekend to retire the experiments and polish it up some.

Thank you, friend!

Sorry about leaving it in a bad state! Like I mentioned above, it wasn’t my intention, just an armchair dev making rookie mistakes ;).

I’ll try and put some time aside to polish it up!

cc @flash3

The repos are usually not very outsider user friendly :D maybe flash can help and explain a bit here …

Nice to have you back!

…change to the multiquant branch.

MultiQuant is under development, so HEAD may break. Use a tagged label as a stable starting point. Devops is done by claude code so do not take the github commits too seriously.

build.sh builds vLLM with MultiQuant including RIY.
start.multiquant.sh is self-explanatory.

Current state:

  • RIY — working (if you only want to prune, use the riy branch instead)
  • TQ/RQ — works for KV and weights (no MoE); use tq2w, tq3w, tq4w
  • XFP — latest hot stuff, still in research, but works for dense and MoE models; vision support added yesterday

The idea behind MultiQuant is to quantize every component (so-called classes) of any BF16 model at load time. Combined with RIY, you can prune according to your workload.

It is for sm12x, so GB10 and RTX PRO will run like hell.

sorry for coming from the position of an outsider who hasn’t been following along…

but what does multiquant mean? What are the definitions of RIY, TQ/RQ, tq2w, tq3w, tq4w, XFP?

Google featured a paper (turboquant) which ran under everyone’s radar which uses rotational preconditioning of feature vectors in higher dimensional space to eliminate the need to scale them. It sparsifies weights and thus reduces substantially the memory requirements while remaining lossless.

This is probably why proprietary Gemini has had 4-8x the context length of competitors for years.

Early days, but the potential is high for leveraging memory better and also increasing throughput on bandwidth constrained systems. But there will be more optimizations necessary.

Rotorquant, MQ, the other acronyms took off as the community started to explore the ideas around TurboQuant.

Riy reap it yourself ?
TQ3w turboquant 3 bit wht
xfp not yet released - cos 1 dependend bit packing concept. always correct.