mjpansa
February 21, 2026, 11:32am
1
Yesterday a REAP version of MiniMax 2.5 showed up already quantised to NVFP4:
ran benchy on it:
model
test
t/s
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
MiniMax-M2.5
pp2048
3342.54 ± 141.85
720.56 ± 26.78
613.84 ± 26.78
720.64 ± 26.81
MiniMax-M2.5
tg32
16.71 ± 0.24
17.00 ± 0.00
MiniMax-M2.5
ctx_pp @ d4096
2994.70 ± 4.09
1474.47 ± 1.87
1367.75 ± 1.87
1474.53 ± 1.86
MiniMax-M2.5
ctx_tg @ d4096
16.49 ± 0.03
17.00 ± 0.00
MiniMax-M2.5
pp2048 @ d4096
2383.55 ± 23.95
966.03 ± 8.69
859.31 ± 8.69
966.08 ± 8.70
MiniMax-M2.5
tg32 @ d4096
16.27 ± 0.03
17.00 ± 0.00
MiniMax-M2.5
ctx_pp @ d8192
2554.64 ± 3.07
3313.43 ± 3.86
3206.72 ± 3.86
3313.50 ± 3.86
MiniMax-M2.5
ctx_tg @ d8192
15.85 ± 0.02
16.33 ± 0.47
MiniMax-M2.5
pp2048 @ d8192
1929.08 ± 34.21
1168.69 ± 18.78
1061.98 ± 18.78
1168.77 ± 18.78
MiniMax-M2.5
tg32 @ d8192
15.66 ± 0.02
16.00 ± 0.00
MiniMax-M2.5
ctx_pp @ d16384
2073.85 ± 1.07
8006.99 ± 4.06
7900.28 ± 4.06
8007.06 ± 4.06
MiniMax-M2.5
ctx_tg @ d16384
14.55 ± 0.26
15.33 ± 0.47
MiniMax-M2.5
pp2048 @ d16384
1463.58 ± 2.90
1506.03 ± 2.77
1399.32 ± 2.77
1506.10 ± 2.78
MiniMax-M2.5
tg32 @ d16384
14.30 ± 0.20
15.00 ± 0.00
MiniMax-M2.5
ctx_pp @ d32768
1519.62 ± 0.70
21669.84 ± 9.96
21563.12 ± 9.96
21669.91 ± 9.96
MiniMax-M2.5
ctx_tg @ d32768
12.95 ± 0.02
13.33 ± 0.47
MiniMax-M2.5
pp2048 @ d32768
953.78 ± 0.49
2253.96 ± 1.10
2147.24 ± 1.10
2254.04 ± 1.10
MiniMax-M2.5
tg32 @ d32768
12.84 ± 0.02
13.00 ± 0.00
MiniMax-M2.5
ctx_pp @ d65535
1000.55 ± 0.63
65605.61 ± 41.25
65498.89 ± 41.25
65605.67 ± 41.25
MiniMax-M2.5
ctx_tg @ d65535
10.49 ± 0.01
11.00 ± 0.00
MiniMax-M2.5
pp2048 @ d65535
571.21 ± 0.27
3692.10 ± 1.68
3585.38 ± 1.68
3692.19 ± 1.68
MiniMax-M2.5
tg32 @ d65535
10.38 ± 0.02
11.00 ± 0.00
had to change to provided vllm command slightly:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=8
python3 -m vllm.entrypoints.openai.api_server
–model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
–host 0.0.0.0
–port 8000
–served-model-name MiniMax-M2.5
–trust-remote-code
–tensor-parallel-size 1
–gpu-memory-utilization 0.85
–max-num-seqs 64
–max-model-len 131072
–disable-custom-all-reduce
–attention-config.use_trtllm_attention=0
–enable-auto-tool-choice
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think
I didnt really have any time the past few weeks so this is still running on scitrera/dgx-spark-vllm:0.14.0-t5, not sure if recent vllm versions already fix some speed problems. Will test more tomorrow. I would guess 30 t/s and 20 t/s for long context should be possible?
Also the customizations that M2 arch brings look like they could benefit from additional fused kernels. Will also want to look at the NVFP4 speedup post and see what that might bring on top
1 Like
Have you tried minimax_m2 as reasoning parser instead of minimax_m2_append_think
mjpansa
February 21, 2026, 3:01pm
3
I have to admit that I didn’t check any outputs so far. Had to go to a wedding and will some tests tomorrow. Happy about any input
flash3
February 21, 2026, 7:06pm
4
I’m just curious. Lobotomy successful, patient brain-dead? What orientation did the lobo-set have? (short for: lobotomizing dataset)
Generally speaking, it’s a good sign if it can still make coffee afterwards… ;)
I tried this one last night as well as a GLM4.7-Flash MTP NVFP4. The output was gibberish, but it’s probably a skill issue on my end. Maybe I didn’t completely use the correct parameters 😅
jwarner
February 21, 2026, 8:29pm
6
This one is a little too strongly REAPed. The creator intended for it to work on a RTX PRO 6000 so the target was 96GB RAM space. 40% is somewhat too much removed.
A 20-25% REAP would be better for Spark.
Edit: NVFP4 or AWQ of this one would be a great target: cerebras/MiniMax-M2.5-REAP-172B-A10B · Hugging Face
cosinus
February 25, 2026, 10:29am
7
A version of the 139B has landed as AWQ thanks to cyanwiki / captonn.
entrpi
February 27, 2026, 1:11am
8
I ran llama-benchy against the i1-Q4_K_S from https://hf.tst.eu/model#MiniMax-M2.5-REAP-139B-A10B-GGUF
model
test
t/s
peak t/s
ttfr (ms)
est_ppt (ms)
e2e_ttft (ms)
MiniMax-M2.5-REAP-172B-A10B
pp2048
334.44 ± 160.18
7516.62 ± 2683.80
7403.41 ± 2683.80
7517.01 ± 2684.01
MiniMax-M2.5-REAP-172B-A10B
tg32
20.36 ± 5.99
22.33 ± 5.91
MiniMax-M2.5-REAP-172B-A10B
ctx_pp @ d4096
463.87 ± 51.61
9048.49 ± 948.20
8935.28 ± 948.20
9048.52 ± 948.20
MiniMax-M2.5-REAP-172B-A10B
ctx_tg @ d4096
18.24 ± 1.87
19.67 ± 2.05
MiniMax-M2.5-REAP-172B-A10B
pp2048 @ d4096
454.65 ± 25.72
4632.24 ± 255.56
4519.03 ± 255.56
4632.28 ± 255.56
MiniMax-M2.5-REAP-172B-A10B
tg32 @ d4096
17.67 ± 1.39
18.67 ± 1.70
MiniMax-M2.5-REAP-172B-A10B
ctx_pp @ d8192
337.76 ± 28.10
24530.93 ± 1967.98
24417.72 ± 1967.98
24530.98 ± 1967.99
MiniMax-M2.5-REAP-172B-A10B
ctx_tg @ d8192
13.12 ± 0.76
15.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B
pp2048 @ d8192
347.91 ± 16.29
6012.66 ± 274.07
5899.45 ± 274.07
6012.69 ± 274.06
MiniMax-M2.5-REAP-172B-A10B
tg32 @ d8192
13.22 ± 0.75
14.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B
ctx_pp @ d16384
263.29 ± 23.51
62834.47 ± 5526.55
62721.26 ± 5526.55
62834.53 ± 5526.57
MiniMax-M2.5-REAP-172B-A10B
ctx_tg @ d16384
9.61 ± 0.63
10.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B
pp2048 @ d16384
258.94 ± 17.57
8058.99 ± 539.89
7945.78 ± 539.89
8059.04 ± 539.90
MiniMax-M2.5-REAP-172B-A10B
tg32 @ d16384
9.28 ± 1.00
10.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B
ctx_pp @ d32768
191.41 ± 16.09
172493.19 ± 14154.36
172379.98 ± 14154.36
172493.91 ± 14155.10
MiniMax-M2.5-REAP-172B-A10B
ctx_tg @ d32768
6.66 ± 0.16
7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B
pp2048 @ d32768
184.07 ± 6.47
11253.80 ± 401.75
11140.59 ± 401.75
11253.86 ± 401.80
MiniMax-M2.5-REAP-172B-A10B
tg32 @ d32768
6.76 ± 0.19
7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B
ctx_pp @ d65535
182.88 ± 3.51
358595.32 ± 6962.06
358482.10 ± 6962.06
358596.26 ± 6963.34
MiniMax-M2.5-REAP-172B-A10B
ctx_tg @ d65535
6.58 ± 0.29
7.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B
pp2048 @ d65535
177.03 ± 8.37
11707.56 ± 542.00
11594.35 ± 542.00
11707.59 ± 542.00
MiniMax-M2.5-REAP-172B-A10B
tg32 @ d65535
6.51 ± 0.03
7.33 ± 0.47
Server llama.cpp version 8123 (f75c4e8bf) built with GNU 13.3.0:
llama.cpp/build/bin/llama-server \
–host 0.0.0.0 \
–port 8001 \
–model ~/models/MiniMax-M2.5-REAP-172B-A10B.i1-IQ4_XS.gguf \
–alias openai/mradermacher/MiniMax-M2.5-REAP-172B-A10B \
–no-mmap \
–flash-attn on \
–n-gpu-layers 999 \
–ctx-size 100000 \
–chat-template-file ~/llama.cpp/models/templates/MiniMax-M2.jinja \
Comparison table vs your NVFP4 run:
test
NVFP4 t/s
our t/s
delta abs
delta %
pp2048
3342.54
334.44
-3008.10
-90.0%
tg32
16.71
20.36
+3.65
+21.8%
ctx_pp @ d4096
2994.70
463.87
-2530.83
-84.5%
ctx_tg @ d4096
16.49
18.24
+1.75
+10.6%
pp2048 @ d4096
2383.55
454.65
-1928.90
-80.9%
tg32 @ d4096
16.27
17.67
+1.40
+8.6%
ctx_pp @ d8192
2554.64
337.76
-2216.88
-86.8%
ctx_tg @ d8192
15.85
13.12
-2.73
-17.2%
pp2048 @ d8192
1929.08
347.91
-1581.17
-82.0%
tg32 @ d8192
15.66
13.22
-2.44
-15.6%
ctx_pp @ d16384
2073.85
263.29
-1810.56
-87.3%
ctx_tg @ d16384
14.55
9.61
-4.94
-34.0%
pp2048 @ d16384
1463.58
258.94
-1204.64
-82.3%
tg32 @ d16384
14.30
9.28
-5.02
-35.1%
ctx_pp @ d32768
1519.62
191.41
-1328.21
-87.4%
ctx_tg @ d32768
12.95
6.66
-6.29
-48.6%
pp2048 @ d32768
953.78
184.07
-769.71
-80.7%
tg32 @ d32768
12.84
6.76
-6.08
-47.4%
ctx_pp @ d65535
1000.55
182.88
-817.67
-81.7%
ctx_tg @ d65535
10.49
6.58
-3.91
-37.3%
pp2048 @ d65535
571.21
177.03
-394.18
-69.0%
tg32 @ d65535
10.38
6.51
-3.87
-37.3%
Takeaways:
My run (llama.cpp + GGUF i1-IQ4_XS) is much slower on prefill than the NVFP4+vLLM run: roughly -69% to -90% on pp2048 / ctx_pp.
Decode at short depth is good: tg32 and tg32 @ d4096 are actually higher than NVFP4 (+22%, +9%).
As context depth increases, GGUF decode drops below NVFP4:
around -17% at d8192
around -34% to -49% from d16384 to d32768
about -37% at d65535.
Then comparing ttfr:
depth
NVFP4 pp2048 ttfr (ms)
GGUF pp2048 ttfr (ms)
NVFP4 ctx_pp ttfr (ms)
GGUF ctx_pp ttfr (ms)
NVFP4 combined (ms)
GGUF combined (ms)
combined slowdown
0
720.56
7516.62
-
-
720.56
7516.62
10.43x
4096
966.03
4632.24
1474.47
9048.49
2440.50
13680.73
5.61x
8192
1168.69
6012.66
3313.43
24530.93
4482.12
30543.59
6.81x
16384
1506.03
8058.99
8006.99
62834.47
9513.02
70893.46
7.45x
32768
2253.96
11253.80
21669.84
172493.19
23923.80
183746.99
7.68x
65535
3692.10
11707.56
65605.61
358595.32
69297.71
370302.88
5.34x
So generally 5-8x slower across long contexts.
entrpi
February 27, 2026, 8:45am
9
Update: changed settings as below and retested.
--ctx-size: 100000 → 80000
--parallel: auto/4 → 1
--cache-ram: default enabled (8192 MiB) → 0 (disabled)
n_slots (effective): 4 → 1
kv_unified: true → false (because parallel=1)
KV cache allocation: ~24242 MiB → ~19406 MiB
test
NVFP4
old_ctx100k_p4
new_ctx80k_p1
new vs NVFP4
new vs old
pp2048
3342.54
334.44
642.37
-80.8%
+92.1%
tg32
16.71
20.36
26.15
+56.5%
+28.5%
ctx_pp @ d4096
2994.70
463.87
640.05
-78.6%
+38.0%
ctx_tg @ d4096
16.49
18.24
25.16
+52.6%
+38.0%
pp2048 @ d4096
2383.55
454.65
589.84
-75.3%
+29.7%
tg32 @ d4096
16.27
17.67
24.23
+48.9%
+37.1%
ctx_pp @ d8192
2554.64
337.76
604.70
-76.3%
+79.0%
ctx_tg @ d8192
15.85
13.12
22.33
+40.9%
+70.1%
pp2048 @ d8192
1929.08
347.91
514.93
-73.3%
+48.0%
tg32 @ d8192
15.66
13.22
19.68
+25.7%
+48.9%
ctx_pp @ d16384
2073.85
263.29
540.87
-73.9%
+105.4%
ctx_tg @ d16384
14.55
9.61
17.72
+21.8%
+84.4%
pp2048 @ d16384
1463.58
258.94
437.18
-70.1%
+68.8%
tg32 @ d16384
14.30
9.28
16.85
+17.8%
+81.5%
ctx_pp @ d32768
1519.62
191.41
455.06
-70.1%
+137.7%
ctx_tg @ d32768
12.95
6.66
13.26
+2.4%
+99.2%
pp2048 @ d32768
953.78
184.07
333.40
-65.0%
+81.1%
tg32 @ d32768
12.84
6.76
12.94
+0.8%
+91.5%
ctx_pp @ d65535
1000.55
182.88
347.37
-65.3%
+89.9%
ctx_tg @ d65535
10.49
6.58
8.68
-17.2%
+32.0%
pp2048 @ d65535
571.21
177.03
227.56
-60.2%
+28.5%
tg32 @ d65535
10.38
6.51
8.51
-18.0%
+30.8%
New results vs NVFP4:
Throughput Comparison (t/s)
test
NVFP4 t/s
new t/s
delta abs
delta %
pp2048
3342.54
642.37
-2700.17
-80.8%
tg32
16.71
26.15
+9.44
+56.5%
ctx_pp @ d4096
2994.70
640.05
-2354.65
-78.6%
ctx_tg @ d4096
16.49
25.16
+8.67
+52.6%
pp2048 @ d4096
2383.55
589.84
-1793.71
-75.3%
tg32 @ d4096
16.27
24.23
+7.96
+48.9%
ctx_pp @ d8192
2554.64
604.70
-1949.94
-76.3%
ctx_tg @ d8192
15.85
22.33
+6.48
+40.9%
pp2048 @ d8192
1929.08
514.93
-1414.15
-73.3%
tg32 @ d8192
15.66
19.68
+4.02
+25.7%
ctx_pp @ d16384
2073.85
540.87
-1532.98
-73.9%
ctx_tg @ d16384
14.55
17.72
+3.17
+21.8%
pp2048 @ d16384
1463.58
437.18
-1026.40
-70.1%
tg32 @ d16384
14.30
16.85
+2.55
+17.8%
ctx_pp @ d32768
1519.62
455.06
-1064.56
-70.1%
ctx_tg @ d32768
12.95
13.26
+0.31
+2.4%
pp2048 @ d32768
953.78
333.40
-620.38
-65.0%
tg32 @ d32768
12.84
12.94
+0.10
+0.8%
ctx_pp @ d65535
1000.55
347.37
-653.18
-65.3%
ctx_tg @ d65535
10.49
8.68
-1.81
-17.2%
pp2048 @ d65535
571.21
227.56
-343.65
-60.2%
tg32 @ d65535
10.38
8.51
-1.87
-18.0%
Now, tg is faster in GGUF, except at longest contexts.
Runtime/Latency Comparison (ttfr-based)
For depth > 0, combined = ctx_pp ttfr + pp2048 ttfr.
depth
NVFP4 pp2048 ttfr (ms)
GGUF pp2048 ttfr (ms)
NVFP4 ctx_pp ttfr (ms)
GGUF ctx_pp ttfr (ms)
NVFP4 combined (ms)
GGUF combined (ms)
slowdown
0
720.56
3278.15
-
-
720.56
3278.15
4.55x
4096
966.03
3561.20
1474.47
6488.55
2440.50
10049.76
4.12x
8192
1168.69
4066.32
3313.43
13636.33
4482.12
17702.64
3.95x
16384
1506.03
4773.60
8006.99
30381.22
9513.02
35154.82
3.70x
32768
2253.96
6231.85
21669.84
72097.46
23923.80
78329.31
3.27x
65535
3692.10
9089.09
65605.61
188746.40
69297.71
197835.48
2.85x
So only about 3 to 5x slower with better settings.
Marsy
March 1, 2026, 11:11pm
11
That’s mostly my experience using REAP models. I was a little disappointed there was only benchmarks.
Someone did exactly as I hoped that they would - a GB10 board targeted NVFP4 quant of the larger REAP I mentioned above.
I’m going to grab this. See if eugr’s build with Marlin and the needed variables works too - would be a good option to compare with the supposedly forthcoming “Atlas engine” too.
1 Like