MiniMax 2.5 REAP - NVFP4 on single DGX Spark

Yesterday a REAP version of MiniMax 2.5 showed up already quantised to NVFP4:

ran benchy on it:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
MiniMax-M2.5 pp2048 3342.54 ± 141.85 720.56 ± 26.78 613.84 ± 26.78 720.64 ± 26.81
MiniMax-M2.5 tg32 16.71 ± 0.24 17.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d4096 2994.70 ± 4.09 1474.47 ± 1.87 1367.75 ± 1.87 1474.53 ± 1.86
MiniMax-M2.5 ctx_tg @ d4096 16.49 ± 0.03 17.00 ± 0.00
MiniMax-M2.5 pp2048 @ d4096 2383.55 ± 23.95 966.03 ± 8.69 859.31 ± 8.69 966.08 ± 8.70
MiniMax-M2.5 tg32 @ d4096 16.27 ± 0.03 17.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d8192 2554.64 ± 3.07 3313.43 ± 3.86 3206.72 ± 3.86 3313.50 ± 3.86
MiniMax-M2.5 ctx_tg @ d8192 15.85 ± 0.02 16.33 ± 0.47
MiniMax-M2.5 pp2048 @ d8192 1929.08 ± 34.21 1168.69 ± 18.78 1061.98 ± 18.78 1168.77 ± 18.78
MiniMax-M2.5 tg32 @ d8192 15.66 ± 0.02 16.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d16384 2073.85 ± 1.07 8006.99 ± 4.06 7900.28 ± 4.06 8007.06 ± 4.06
MiniMax-M2.5 ctx_tg @ d16384 14.55 ± 0.26 15.33 ± 0.47
MiniMax-M2.5 pp2048 @ d16384 1463.58 ± 2.90 1506.03 ± 2.77 1399.32 ± 2.77 1506.10 ± 2.78
MiniMax-M2.5 tg32 @ d16384 14.30 ± 0.20 15.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d32768 1519.62 ± 0.70 21669.84 ± 9.96 21563.12 ± 9.96 21669.91 ± 9.96
MiniMax-M2.5 ctx_tg @ d32768 12.95 ± 0.02 13.33 ± 0.47
MiniMax-M2.5 pp2048 @ d32768 953.78 ± 0.49 2253.96 ± 1.10 2147.24 ± 1.10 2254.04 ± 1.10
MiniMax-M2.5 tg32 @ d32768 12.84 ± 0.02 13.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d65535 1000.55 ± 0.63 65605.61 ± 41.25 65498.89 ± 41.25 65605.67 ± 41.25
MiniMax-M2.5 ctx_tg @ d65535 10.49 ± 0.01 11.00 ± 0.00
MiniMax-M2.5 pp2048 @ d65535 571.21 ± 0.27 3692.10 ± 1.68 3585.38 ± 1.68 3692.19 ± 1.68
MiniMax-M2.5 tg32 @ d65535 10.38 ± 0.02 11.00 ± 0.00

had to change to provided vllm command slightly:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=8

python3 -m vllm.entrypoints.openai.api_server
–model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
–host 0.0.0.0
–port 8000
–served-model-name MiniMax-M2.5
–trust-remote-code
–tensor-parallel-size 1
–gpu-memory-utilization 0.85
–max-num-seqs 64
–max-model-len 131072
–disable-custom-all-reduce
–attention-config.use_trtllm_attention=0
–enable-auto-tool-choice
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think

I didnt really have any time the past few weeks so this is still running on scitrera/dgx-spark-vllm:0.14.0-t5, not sure if recent vllm versions already fix some speed problems. Will test more tomorrow. I would guess 30 t/s and 20 t/s for long context should be possible?
Also the customizations that M2 arch brings look like they could benefit from additional fused kernels. Will also want to look at the NVFP4 speedup post and see what that might bring on top

1 Like

Have you tried minimax_m2 as reasoning parser instead of minimax_m2_append_think

I have to admit that I didn’t check any outputs so far. Had to go to a wedding and will some tests tomorrow. Happy about any input

I’m just curious. Lobotomy successful, patient brain-dead? What orientation did the lobo-set have? (short for: lobotomizing dataset)

Generally speaking, it’s a good sign if it can still make coffee afterwards… ;)

I tried this one last night as well as a GLM4.7-Flash MTP NVFP4. The output was gibberish, but it’s probably a skill issue on my end. Maybe I didn’t completely use the correct parameters 😅

This one is a little too strongly REAPed. The creator intended for it to work on a RTX PRO 6000 so the target was 96GB RAM space. 40% is somewhat too much removed.

A 20-25% REAP would be better for Spark.

Edit: NVFP4 or AWQ of this one would be a great target: cerebras/MiniMax-M2.5-REAP-172B-A10B · Hugging Face

A version of the 139B has landed as AWQ thanks to cyanwiki / captonn.

I ran llama-benchy against the i1-Q4_K_S from https://hf.tst.eu/model#MiniMax-M2.5-REAP-139B-A10B-GGUF

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
MiniMax-M2.5-REAP-172B-A10B pp2048 334.44 ± 160.18 7516.62 ± 2683.80 7403.41 ± 2683.80 7517.01 ± 2684.01
MiniMax-M2.5-REAP-172B-A10B tg32 20.36 ± 5.99 22.33 ± 5.91
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d4096 463.87 ± 51.61 9048.49 ± 948.20 8935.28 ± 948.20 9048.52 ± 948.20
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d4096 18.24 ± 1.87 19.67 ± 2.05
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d4096 454.65 ± 25.72 4632.24 ± 255.56 4519.03 ± 255.56 4632.28 ± 255.56
MiniMax-M2.5-REAP-172B-A10B tg32 @ d4096 17.67 ± 1.39 18.67 ± 1.70
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d8192 337.76 ± 28.10 24530.93 ± 1967.98 24417.72 ± 1967.98 24530.98 ± 1967.99
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d8192 13.12 ± 0.76 15.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d8192 347.91 ± 16.29 6012.66 ± 274.07 5899.45 ± 274.07 6012.69 ± 274.06
MiniMax-M2.5-REAP-172B-A10B tg32 @ d8192 13.22 ± 0.75 14.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d16384 263.29 ± 23.51 62834.47 ± 5526.55 62721.26 ± 5526.55 62834.53 ± 5526.57
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d16384 9.61 ± 0.63 10.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d16384 258.94 ± 17.57 8058.99 ± 539.89 7945.78 ± 539.89 8059.04 ± 539.90
MiniMax-M2.5-REAP-172B-A10B tg32 @ d16384 9.28 ± 1.00 10.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d32768 191.41 ± 16.09 172493.19 ± 14154.36 172379.98 ± 14154.36 172493.91 ± 14155.10
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d32768 6.66 ± 0.16 7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d32768 184.07 ± 6.47 11253.80 ± 401.75 11140.59 ± 401.75 11253.86 ± 401.80
MiniMax-M2.5-REAP-172B-A10B tg32 @ d32768 6.76 ± 0.19 7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d65535 182.88 ± 3.51 358595.32 ± 6962.06 358482.10 ± 6962.06 358596.26 ± 6963.34
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d65535 6.58 ± 0.29 7.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d65535 177.03 ± 8.37 11707.56 ± 542.00 11594.35 ± 542.00 11707.59 ± 542.00
MiniMax-M2.5-REAP-172B-A10B tg32 @ d65535 6.51 ± 0.03 7.33 ± 0.47

Server llama.cpp version 8123 (f75c4e8bf) built with GNU 13.3.0:

llama.cpp/build/bin/llama-server \
    –host 0.0.0.0 \
    –port 8001 \
    –model ~/models/MiniMax-M2.5-REAP-172B-A10B.i1-IQ4_XS.gguf \
    –alias openai/mradermacher/MiniMax-M2.5-REAP-172B-A10B \
    –no-mmap \
    –flash-attn on \
    –n-gpu-layers 999 \
    –ctx-size 100000 \
    –chat-template-file ~/llama.cpp/models/templates/MiniMax-M2.jinja \

Comparison table vs your NVFP4 run:

test NVFP4 t/s our t/s delta abs delta %
pp2048 3342.54 334.44 -3008.10 -90.0%
tg32 16.71 20.36 +3.65 +21.8%
ctx_pp @ d4096 2994.70 463.87 -2530.83 -84.5%
ctx_tg @ d4096 16.49 18.24 +1.75 +10.6%
pp2048 @ d4096 2383.55 454.65 -1928.90 -80.9%
tg32 @ d4096 16.27 17.67 +1.40 +8.6%
ctx_pp @ d8192 2554.64 337.76 -2216.88 -86.8%
ctx_tg @ d8192 15.85 13.12 -2.73 -17.2%
pp2048 @ d8192 1929.08 347.91 -1581.17 -82.0%
tg32 @ d8192 15.66 13.22 -2.44 -15.6%
ctx_pp @ d16384 2073.85 263.29 -1810.56 -87.3%
ctx_tg @ d16384 14.55 9.61 -4.94 -34.0%
pp2048 @ d16384 1463.58 258.94 -1204.64 -82.3%
tg32 @ d16384 14.30 9.28 -5.02 -35.1%
ctx_pp @ d32768 1519.62 191.41 -1328.21 -87.4%
ctx_tg @ d32768 12.95 6.66 -6.29 -48.6%
pp2048 @ d32768 953.78 184.07 -769.71 -80.7%
tg32 @ d32768 12.84 6.76 -6.08 -47.4%
ctx_pp @ d65535 1000.55 182.88 -817.67 -81.7%
ctx_tg @ d65535 10.49 6.58 -3.91 -37.3%
pp2048 @ d65535 571.21 177.03 -394.18 -69.0%
tg32 @ d65535 10.38 6.51 -3.87 -37.3%

Takeaways:

  • My run (llama.cpp + GGUF i1-IQ4_XS) is much slower on prefill than the NVFP4+vLLM run: roughly -69% to -90% on pp2048 / ctx_pp.
  • Decode at short depth is good: tg32 and tg32 @ d4096 are actually higher than NVFP4 (+22%, +9%).
  • As context depth increases, GGUF decode drops below NVFP4:
    • around -17% at d8192
    • around -34% to -49% from d16384 to d32768
    • about -37% at d65535.

Then comparing ttfr:

depth NVFP4 pp2048 ttfr (ms) GGUF pp2048 ttfr (ms) NVFP4 ctx_pp ttfr (ms) GGUF ctx_pp ttfr (ms) NVFP4 combined (ms) GGUF combined (ms) combined slowdown
0 720.56 7516.62 - - 720.56 7516.62 10.43x
4096 966.03 4632.24 1474.47 9048.49 2440.50 13680.73 5.61x
8192 1168.69 6012.66 3313.43 24530.93 4482.12 30543.59 6.81x
16384 1506.03 8058.99 8006.99 62834.47 9513.02 70893.46 7.45x
32768 2253.96 11253.80 21669.84 172493.19 23923.80 183746.99 7.68x
65535 3692.10 11707.56 65605.61 358595.32 69297.71 370302.88 5.34x

So generally 5-8x slower across long contexts.

Update: changed settings as below and retested.

  • --ctx-size: 100000 → 80000
  • --parallel: auto/4 → 1
  • --cache-ram: default enabled (8192 MiB) → 0 (disabled)
  • n_slots (effective): 4 → 1
  • kv_unified: true → false (because parallel=1)
  • KV cache allocation: ~24242 MiB → ~19406 MiB
test NVFP4 old_ctx100k_p4 new_ctx80k_p1 new vs NVFP4 new vs old
pp2048 3342.54 334.44 642.37 -80.8% +92.1%
tg32 16.71 20.36 26.15 +56.5% +28.5%
ctx_pp @ d4096 2994.70 463.87 640.05 -78.6% +38.0%
ctx_tg @ d4096 16.49 18.24 25.16 +52.6% +38.0%
pp2048 @ d4096 2383.55 454.65 589.84 -75.3% +29.7%
tg32 @ d4096 16.27 17.67 24.23 +48.9% +37.1%
ctx_pp @ d8192 2554.64 337.76 604.70 -76.3% +79.0%
ctx_tg @ d8192 15.85 13.12 22.33 +40.9% +70.1%
pp2048 @ d8192 1929.08 347.91 514.93 -73.3% +48.0%
tg32 @ d8192 15.66 13.22 19.68 +25.7% +48.9%
ctx_pp @ d16384 2073.85 263.29 540.87 -73.9% +105.4%
ctx_tg @ d16384 14.55 9.61 17.72 +21.8% +84.4%
pp2048 @ d16384 1463.58 258.94 437.18 -70.1% +68.8%
tg32 @ d16384 14.30 9.28 16.85 +17.8% +81.5%
ctx_pp @ d32768 1519.62 191.41 455.06 -70.1% +137.7%
ctx_tg @ d32768 12.95 6.66 13.26 +2.4% +99.2%
pp2048 @ d32768 953.78 184.07 333.40 -65.0% +81.1%
tg32 @ d32768 12.84 6.76 12.94 +0.8% +91.5%
ctx_pp @ d65535 1000.55 182.88 347.37 -65.3% +89.9%
ctx_tg @ d65535 10.49 6.58 8.68 -17.2% +32.0%
pp2048 @ d65535 571.21 177.03 227.56 -60.2% +28.5%
tg32 @ d65535 10.38 6.51 8.51 -18.0% +30.8%

New results vs NVFP4:

Throughput Comparison (t/s)

test NVFP4 t/s new t/s delta abs delta %
pp2048 3342.54 642.37 -2700.17 -80.8%
tg32 16.71 26.15 +9.44 +56.5%
ctx_pp @ d4096 2994.70 640.05 -2354.65 -78.6%
ctx_tg @ d4096 16.49 25.16 +8.67 +52.6%
pp2048 @ d4096 2383.55 589.84 -1793.71 -75.3%
tg32 @ d4096 16.27 24.23 +7.96 +48.9%
ctx_pp @ d8192 2554.64 604.70 -1949.94 -76.3%
ctx_tg @ d8192 15.85 22.33 +6.48 +40.9%
pp2048 @ d8192 1929.08 514.93 -1414.15 -73.3%
tg32 @ d8192 15.66 19.68 +4.02 +25.7%
ctx_pp @ d16384 2073.85 540.87 -1532.98 -73.9%
ctx_tg @ d16384 14.55 17.72 +3.17 +21.8%
pp2048 @ d16384 1463.58 437.18 -1026.40 -70.1%
tg32 @ d16384 14.30 16.85 +2.55 +17.8%
ctx_pp @ d32768 1519.62 455.06 -1064.56 -70.1%
ctx_tg @ d32768 12.95 13.26 +0.31 +2.4%
pp2048 @ d32768 953.78 333.40 -620.38 -65.0%
tg32 @ d32768 12.84 12.94 +0.10 +0.8%
ctx_pp @ d65535 1000.55 347.37 -653.18 -65.3%
ctx_tg @ d65535 10.49 8.68 -1.81 -17.2%
pp2048 @ d65535 571.21 227.56 -343.65 -60.2%
tg32 @ d65535 10.38 8.51 -1.87 -18.0%

Now, tg is faster in GGUF, except at longest contexts.

Runtime/Latency Comparison (ttfr-based)
For depth > 0, combined = ctx_pp ttfr + pp2048 ttfr.

depth NVFP4 pp2048 ttfr (ms) GGUF pp2048 ttfr (ms) NVFP4 ctx_pp ttfr (ms) GGUF ctx_pp ttfr (ms) NVFP4 combined (ms) GGUF combined (ms) slowdown
0 720.56 3278.15 - - 720.56 3278.15 4.55x
4096 966.03 3561.20 1474.47 6488.55 2440.50 10049.76 4.12x
8192 1168.69 4066.32 3313.43 13636.33 4482.12 17702.64 3.95x
16384 1506.03 4773.60 8006.99 30381.22 9513.02 35154.82 3.70x
32768 2253.96 6231.85 21669.84 72097.46 23923.80 78329.31 3.27x
65535 3692.10 9089.09 65605.61 188746.40 69297.71 197835.48 2.85x

So only about 3 to 5x slower with better settings.

That’s mostly my experience using REAP models. I was a little disappointed there was only benchmarks.

Someone did exactly as I hoped that they would - a GB10 board targeted NVFP4 quant of the larger REAP I mentioned above.

I’m going to grab this. See if eugr’s build with Marlin and the needed variables works too - would be a good option to compare with the supposedly forthcoming “Atlas engine” too.

1 Like