MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16

mjpansa · April 17, 2026, 4:50pm

Hey guys,

since over the past days I was eagerly waiting for M2.7 REAPS to come out with quants suitable for single DGX Sparks but I could not find anything and also had holidays from monday to wednesday I created one myself. Unfortunately I only have a single spark plus a 256GB Threadripper with 2x 3090s so my setup is very shitty for calibrating with huge samples sizes and samples lengths.
Also running evals takes forever, I ran GPQA Diamond for over 15 hours just to discover I was too stingy with the max tokens at 16k and 30% of samples didnt finish reasoning within this budget. It still got 60% so the general intelligence seems to be there. Will run some more evals on external compute over the weekend. will need to rent some H200 most likely.

Anyways I thought i still share it here if someone wants to give it a go as well :)
Speed is basically the same as 122B A10B Qwen model with autoround.

There is definitely room for improvements with new kv cache quants landing in vllm. On a single spark with fp8 you can get around 100k context i guess. I havnt pushed it all too far since my agents are also sharing some portion of the ran most of the time

depth	prefill tok/s	decode tok/s	TTFT (ms)
0	2469.3 ± 13.3	29.28 ± 0.05	864.5
4096	2089.9 ± 12.5	27.73 ± 0.05	2784.8
8192	1890.3 ± 5.2	26.28 ± 0.05	5062.3
16384	1601.1 ± 6.5	23.88 ± 0.05	10647.7

Happy for any feedback, this is just a first draft. WIll need to pick some better and bigger datasets for REAP and quantisation calibration. Just wanted to validate everything end to end first before renting out more compute

grindstone · April 17, 2026, 4:54pm

TQ3/TQ4 buys you a little more context, worked when i tested it for 12 hours.
So its possible :)

chrm · April 17, 2026, 5:59pm

TurboQuant quantised LLM models appear on Hugging Face :-).

The model can be run via

carlos.albarran.mx · April 17, 2026, 6:30pm

This is cool!! Is there a way to run this @eugr

In the meantime I will give it a try to your quant @mjpansa

carlos.albarran.mx · April 17, 2026, 7:49pm

In one single spark? Any scripts or variables?

eugr · April 17, 2026, 7:50pm

You can give it a try with the most recent build - some of the Turboquant-related PRs have been merged into vLLM recently.

mjpansa · April 17, 2026, 10:43pm

My spark is still running aime benchmarks, will try turbo quant tomorrow

jwarner · April 17, 2026, 11:30pm

Not yet.

The main repo now has some TQ options for KV cache, but this requires the whole model too.

You need turboquant-vllm for that one - for now.

It might be worth looking into setting that up for testing. This isn’t the only model that could fit on 1 Spark much more efficiently under such a framework.

Topic		Replies	Views
MiniMax M2.7 TQ3 - A TurboQuant 3-bit quantized version of MiniMax-M2.7 for single DGX Spark DGX Spark / GB10 cuda	2	1499	April 16, 2026
MiniMax 2.5 REAP - NVFP4 on single DGX Spark DGX Spark / GB10	25	2677	April 1, 2026
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	92	5955	April 12, 2026
DGX Spark performance DGX Spark / GB10	50	4080	February 27, 2026
MiniMax-2.5 on DGX Spark (thanks to Unsloth https://unsloth.ai/docs/models/minimax-2.5) DGX Spark / GB10 llama	12	3492	February 20, 2026
MiniMax M2.7 NFVP4 Recipe & Benchmarks DGX Spark / GB10 llama	61	3460	April 18, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	3844	January 2, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1219	February 13, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	8721	April 9, 2026
Run Qwen3.5-27B with spark-vllm-docker DGX Spark / GB10 llama	1	1683	March 5, 2026

MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16

Related topics