MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16

Hey guys,

since over the past days I was eagerly waiting for M2.7 REAPS to come out with quants suitable for single DGX Sparks but I could not find anything and also had holidays from monday to wednesday I created one myself. Unfortunately I only have a single spark plus a 256GB Threadripper with 2x 3090s so my setup is very shitty for calibrating with huge samples sizes and samples lengths.
Also running evals takes forever, I ran GPQA Diamond for over 15 hours just to discover I was too stingy with the max tokens at 16k and 30% of samples didnt finish reasoning within this budget. It still got 60% so the general intelligence seems to be there. Will run some more evals on external compute over the weekend. will need to rent some H200 most likely.

Anyways I thought i still share it here if someone wants to give it a go as well :)
Speed is basically the same as 122B A10B Qwen model with autoround.

There is definitely room for improvements with new kv cache quants landing in vllm. On a single spark with fp8 you can get around 100k context i guess. I havnt pushed it all too far since my agents are also sharing some portion of the ran most of the time

depth prefill tok/s decode tok/s TTFT (ms)
0 2469.3 ± 13.3 29.28 ± 0.05 864.5
4096 2089.9 ± 12.5 27.73 ± 0.05 2784.8
8192 1890.3 ± 5.2 26.28 ± 0.05 5062.3
16384 1601.1 ± 6.5 23.88 ± 0.05 10647.7

Happy for any feedback, this is just a first draft. WIll need to pick some better and bigger datasets for REAP and quantisation calibration. Just wanted to validate everything end to end first before renting out more compute

1 Like

TQ3/TQ4 buys you a little more context, worked when i tested it for 12 hours.
So its possible :)

2 Likes

TurboQuant quantised LLM models appear on Hugging Face :-).

The model can be run via

1 Like

This is cool!! Is there a way to run this @eugr

In the meantime I will give it a try to your quant @mjpansa

1 Like

In one single spark? Any scripts or variables?

You can give it a try with the most recent build - some of the Turboquant-related PRs have been merged into vLLM recently.

2 Likes

My spark is still running aime benchmarks, will try turbo quant tomorrow

Not yet.

The main repo now has some TQ options for KV cache, but this requires the whole model too.

You need turboquant-vllm for that one - for now.

It might be worth looking into setting that up for testing. This isn’t the only model that could fit on 1 Spark much more efficiently under such a framework.