MiniMax 2.5 REAP - NVFP4 on single DGX Spark

Yesterday a REAP version of MiniMax 2.5 showed up already quantised to NVFP4:

ran benchy on it:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
MiniMax-M2.5 pp2048 3342.54 ± 141.85 720.56 ± 26.78 613.84 ± 26.78 720.64 ± 26.81
MiniMax-M2.5 tg32 16.71 ± 0.24 17.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d4096 2994.70 ± 4.09 1474.47 ± 1.87 1367.75 ± 1.87 1474.53 ± 1.86
MiniMax-M2.5 ctx_tg @ d4096 16.49 ± 0.03 17.00 ± 0.00
MiniMax-M2.5 pp2048 @ d4096 2383.55 ± 23.95 966.03 ± 8.69 859.31 ± 8.69 966.08 ± 8.70
MiniMax-M2.5 tg32 @ d4096 16.27 ± 0.03 17.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d8192 2554.64 ± 3.07 3313.43 ± 3.86 3206.72 ± 3.86 3313.50 ± 3.86
MiniMax-M2.5 ctx_tg @ d8192 15.85 ± 0.02 16.33 ± 0.47
MiniMax-M2.5 pp2048 @ d8192 1929.08 ± 34.21 1168.69 ± 18.78 1061.98 ± 18.78 1168.77 ± 18.78
MiniMax-M2.5 tg32 @ d8192 15.66 ± 0.02 16.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d16384 2073.85 ± 1.07 8006.99 ± 4.06 7900.28 ± 4.06 8007.06 ± 4.06
MiniMax-M2.5 ctx_tg @ d16384 14.55 ± 0.26 15.33 ± 0.47
MiniMax-M2.5 pp2048 @ d16384 1463.58 ± 2.90 1506.03 ± 2.77 1399.32 ± 2.77 1506.10 ± 2.78
MiniMax-M2.5 tg32 @ d16384 14.30 ± 0.20 15.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d32768 1519.62 ± 0.70 21669.84 ± 9.96 21563.12 ± 9.96 21669.91 ± 9.96
MiniMax-M2.5 ctx_tg @ d32768 12.95 ± 0.02 13.33 ± 0.47
MiniMax-M2.5 pp2048 @ d32768 953.78 ± 0.49 2253.96 ± 1.10 2147.24 ± 1.10 2254.04 ± 1.10
MiniMax-M2.5 tg32 @ d32768 12.84 ± 0.02 13.00 ± 0.00
MiniMax-M2.5 ctx_pp @ d65535 1000.55 ± 0.63 65605.61 ± 41.25 65498.89 ± 41.25 65605.67 ± 41.25
MiniMax-M2.5 ctx_tg @ d65535 10.49 ± 0.01 11.00 ± 0.00
MiniMax-M2.5 pp2048 @ d65535 571.21 ± 0.27 3692.10 ± 1.68 3585.38 ± 1.68 3692.19 ± 1.68
MiniMax-M2.5 tg32 @ d65535 10.38 ± 0.02 11.00 ± 0.00

had to change to provided vllm command slightly:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=8

python3 -m vllm.entrypoints.openai.api_server
–model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
–host 0.0.0.0
–port 8000
–served-model-name MiniMax-M2.5
–trust-remote-code
–tensor-parallel-size 1
–gpu-memory-utilization 0.85
–max-num-seqs 64
–max-model-len 131072
–disable-custom-all-reduce
–attention-config.use_trtllm_attention=0
–enable-auto-tool-choice
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think

I didnt really have any time the past few weeks so this is still running on scitrera/dgx-spark-vllm:0.14.0-t5, not sure if recent vllm versions already fix some speed problems. Will test more tomorrow. I would guess 30 t/s and 20 t/s for long context should be possible?
Also the customizations that M2 arch brings look like they could benefit from additional fused kernels. Will also want to look at the NVFP4 speedup post and see what that might bring on top

Have you tried minimax_m2 as reasoning parser instead of minimax_m2_append_think

I have to admit that I didn’t check any outputs so far. Had to go to a wedding and will some tests tomorrow. Happy about any input

I’m just curious. Lobotomy successful, patient brain-dead? What orientation did the lobo-set have? (short for: lobotomizing dataset)

Generally speaking, it’s a good sign if it can still make coffee afterwards… ;)

I tried this one last night as well as a GLM4.7-Flash MTP NVFP4. The output was gibberish, but it’s probably a skill issue on my end. Maybe I didn’t completely use the correct parameters 😅

This one is a little too strongly REAPed. The creator intended for it to work on a RTX PRO 6000 so the target was 96GB RAM space. 40% is somewhat too much removed.

A 20-25% REAP would be better for Spark.

Edit: NVFP4 or AWQ of this one would be a great target: cerebras/MiniMax-M2.5-REAP-172B-A10B · Hugging Face

A version of the 139B has landed as AWQ thanks to cyanwiki / captonn.

I ran llama-benchy against the i1-Q4_K_S from https://hf.tst.eu/model#MiniMax-M2.5-REAP-139B-A10B-GGUF

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
MiniMax-M2.5-REAP-172B-A10B pp2048 334.44 ± 160.18 7516.62 ± 2683.80 7403.41 ± 2683.80 7517.01 ± 2684.01
MiniMax-M2.5-REAP-172B-A10B tg32 20.36 ± 5.99 22.33 ± 5.91
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d4096 463.87 ± 51.61 9048.49 ± 948.20 8935.28 ± 948.20 9048.52 ± 948.20
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d4096 18.24 ± 1.87 19.67 ± 2.05
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d4096 454.65 ± 25.72 4632.24 ± 255.56 4519.03 ± 255.56 4632.28 ± 255.56
MiniMax-M2.5-REAP-172B-A10B tg32 @ d4096 17.67 ± 1.39 18.67 ± 1.70
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d8192 337.76 ± 28.10 24530.93 ± 1967.98 24417.72 ± 1967.98 24530.98 ± 1967.99
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d8192 13.12 ± 0.76 15.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d8192 347.91 ± 16.29 6012.66 ± 274.07 5899.45 ± 274.07 6012.69 ± 274.06
MiniMax-M2.5-REAP-172B-A10B tg32 @ d8192 13.22 ± 0.75 14.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d16384 263.29 ± 23.51 62834.47 ± 5526.55 62721.26 ± 5526.55 62834.53 ± 5526.57
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d16384 9.61 ± 0.63 10.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d16384 258.94 ± 17.57 8058.99 ± 539.89 7945.78 ± 539.89 8059.04 ± 539.90
MiniMax-M2.5-REAP-172B-A10B tg32 @ d16384 9.28 ± 1.00 10.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d32768 191.41 ± 16.09 172493.19 ± 14154.36 172379.98 ± 14154.36 172493.91 ± 14155.10
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d32768 6.66 ± 0.16 7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d32768 184.07 ± 6.47 11253.80 ± 401.75 11140.59 ± 401.75 11253.86 ± 401.80
MiniMax-M2.5-REAP-172B-A10B tg32 @ d32768 6.76 ± 0.19 7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B ctx_pp @ d65535 182.88 ± 3.51 358595.32 ± 6962.06 358482.10 ± 6962.06 358596.26 ± 6963.34
MiniMax-M2.5-REAP-172B-A10B ctx_tg @ d65535 6.58 ± 0.29 7.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B pp2048 @ d65535 177.03 ± 8.37 11707.56 ± 542.00 11594.35 ± 542.00 11707.59 ± 542.00
MiniMax-M2.5-REAP-172B-A10B tg32 @ d65535 6.51 ± 0.03 7.33 ± 0.47

Server llama.cpp version 8123 (f75c4e8bf) built with GNU 13.3.0:

llama.cpp/build/bin/llama-server \
    –host 0.0.0.0 \
    –port 8001 \
    –model ~/models/MiniMax-M2.5-REAP-172B-A10B.i1-IQ4_XS.gguf \
    –alias openai/mradermacher/MiniMax-M2.5-REAP-172B-A10B \
    –no-mmap \
    –flash-attn on \
    –n-gpu-layers 999 \
    –ctx-size 100000 \
    –chat-template-file ~/llama.cpp/models/templates/MiniMax-M2.jinja \

Comparison table vs your NVFP4 run:

test NVFP4 t/s our t/s delta abs delta %
pp2048 3342.54 334.44 -3008.10 -90.0%
tg32 16.71 20.36 +3.65 +21.8%
ctx_pp @ d4096 2994.70 463.87 -2530.83 -84.5%
ctx_tg @ d4096 16.49 18.24 +1.75 +10.6%
pp2048 @ d4096 2383.55 454.65 -1928.90 -80.9%
tg32 @ d4096 16.27 17.67 +1.40 +8.6%
ctx_pp @ d8192 2554.64 337.76 -2216.88 -86.8%
ctx_tg @ d8192 15.85 13.12 -2.73 -17.2%
pp2048 @ d8192 1929.08 347.91 -1581.17 -82.0%
tg32 @ d8192 15.66 13.22 -2.44 -15.6%
ctx_pp @ d16384 2073.85 263.29 -1810.56 -87.3%
ctx_tg @ d16384 14.55 9.61 -4.94 -34.0%
pp2048 @ d16384 1463.58 258.94 -1204.64 -82.3%
tg32 @ d16384 14.30 9.28 -5.02 -35.1%
ctx_pp @ d32768 1519.62 191.41 -1328.21 -87.4%
ctx_tg @ d32768 12.95 6.66 -6.29 -48.6%
pp2048 @ d32768 953.78 184.07 -769.71 -80.7%
tg32 @ d32768 12.84 6.76 -6.08 -47.4%
ctx_pp @ d65535 1000.55 182.88 -817.67 -81.7%
ctx_tg @ d65535 10.49 6.58 -3.91 -37.3%
pp2048 @ d65535 571.21 177.03 -394.18 -69.0%
tg32 @ d65535 10.38 6.51 -3.87 -37.3%

Takeaways:

  • My run (llama.cpp + GGUF i1-IQ4_XS) is much slower on prefill than the NVFP4+vLLM run: roughly -69% to -90% on pp2048 / ctx_pp.
  • Decode at short depth is good: tg32 and tg32 @ d4096 are actually higher than NVFP4 (+22%, +9%).
  • As context depth increases, GGUF decode drops below NVFP4:
    • around -17% at d8192
    • around -34% to -49% from d16384 to d32768
    • about -37% at d65535.

Then comparing ttfr:

depth NVFP4 pp2048 ttfr (ms) GGUF pp2048 ttfr (ms) NVFP4 ctx_pp ttfr (ms) GGUF ctx_pp ttfr (ms) NVFP4 combined (ms) GGUF combined (ms) combined slowdown
0 720.56 7516.62 - - 720.56 7516.62 10.43x
4096 966.03 4632.24 1474.47 9048.49 2440.50 13680.73 5.61x
8192 1168.69 6012.66 3313.43 24530.93 4482.12 30543.59 6.81x
16384 1506.03 8058.99 8006.99 62834.47 9513.02 70893.46 7.45x
32768 2253.96 11253.80 21669.84 172493.19 23923.80 183746.99 7.68x
65535 3692.10 11707.56 65605.61 358595.32 69297.71 370302.88 5.34x

So generally 5-8x slower across long contexts.

Update: changed settings as below and retested.

  • --ctx-size: 100000 → 80000
  • --parallel: auto/4 → 1
  • --cache-ram: default enabled (8192 MiB) → 0 (disabled)
  • n_slots (effective): 4 → 1
  • kv_unified: true → false (because parallel=1)
  • KV cache allocation: ~24242 MiB → ~19406 MiB
test NVFP4 old_ctx100k_p4 new_ctx80k_p1 new vs NVFP4 new vs old
pp2048 3342.54 334.44 642.37 -80.8% +92.1%
tg32 16.71 20.36 26.15 +56.5% +28.5%
ctx_pp @ d4096 2994.70 463.87 640.05 -78.6% +38.0%
ctx_tg @ d4096 16.49 18.24 25.16 +52.6% +38.0%
pp2048 @ d4096 2383.55 454.65 589.84 -75.3% +29.7%
tg32 @ d4096 16.27 17.67 24.23 +48.9% +37.1%
ctx_pp @ d8192 2554.64 337.76 604.70 -76.3% +79.0%
ctx_tg @ d8192 15.85 13.12 22.33 +40.9% +70.1%
pp2048 @ d8192 1929.08 347.91 514.93 -73.3% +48.0%
tg32 @ d8192 15.66 13.22 19.68 +25.7% +48.9%
ctx_pp @ d16384 2073.85 263.29 540.87 -73.9% +105.4%
ctx_tg @ d16384 14.55 9.61 17.72 +21.8% +84.4%
pp2048 @ d16384 1463.58 258.94 437.18 -70.1% +68.8%
tg32 @ d16384 14.30 9.28 16.85 +17.8% +81.5%
ctx_pp @ d32768 1519.62 191.41 455.06 -70.1% +137.7%
ctx_tg @ d32768 12.95 6.66 13.26 +2.4% +99.2%
pp2048 @ d32768 953.78 184.07 333.40 -65.0% +81.1%
tg32 @ d32768 12.84 6.76 12.94 +0.8% +91.5%
ctx_pp @ d65535 1000.55 182.88 347.37 -65.3% +89.9%
ctx_tg @ d65535 10.49 6.58 8.68 -17.2% +32.0%
pp2048 @ d65535 571.21 177.03 227.56 -60.2% +28.5%
tg32 @ d65535 10.38 6.51 8.51 -18.0% +30.8%

New results vs NVFP4:

Throughput Comparison (t/s)

test NVFP4 t/s new t/s delta abs delta %
pp2048 3342.54 642.37 -2700.17 -80.8%
tg32 16.71 26.15 +9.44 +56.5%
ctx_pp @ d4096 2994.70 640.05 -2354.65 -78.6%
ctx_tg @ d4096 16.49 25.16 +8.67 +52.6%
pp2048 @ d4096 2383.55 589.84 -1793.71 -75.3%
tg32 @ d4096 16.27 24.23 +7.96 +48.9%
ctx_pp @ d8192 2554.64 604.70 -1949.94 -76.3%
ctx_tg @ d8192 15.85 22.33 +6.48 +40.9%
pp2048 @ d8192 1929.08 514.93 -1414.15 -73.3%
tg32 @ d8192 15.66 19.68 +4.02 +25.7%
ctx_pp @ d16384 2073.85 540.87 -1532.98 -73.9%
ctx_tg @ d16384 14.55 17.72 +3.17 +21.8%
pp2048 @ d16384 1463.58 437.18 -1026.40 -70.1%
tg32 @ d16384 14.30 16.85 +2.55 +17.8%
ctx_pp @ d32768 1519.62 455.06 -1064.56 -70.1%
ctx_tg @ d32768 12.95 13.26 +0.31 +2.4%
pp2048 @ d32768 953.78 333.40 -620.38 -65.0%
tg32 @ d32768 12.84 12.94 +0.10 +0.8%
ctx_pp @ d65535 1000.55 347.37 -653.18 -65.3%
ctx_tg @ d65535 10.49 8.68 -1.81 -17.2%
pp2048 @ d65535 571.21 227.56 -343.65 -60.2%
tg32 @ d65535 10.38 8.51 -1.87 -18.0%

Now, tg is faster in GGUF, except at longest contexts.

Runtime/Latency Comparison (ttfr-based)
For depth > 0, combined = ctx_pp ttfr + pp2048 ttfr.

depth NVFP4 pp2048 ttfr (ms) GGUF pp2048 ttfr (ms) NVFP4 ctx_pp ttfr (ms) GGUF ctx_pp ttfr (ms) NVFP4 combined (ms) GGUF combined (ms) slowdown
0 720.56 3278.15 - - 720.56 3278.15 4.55x
4096 966.03 3561.20 1474.47 6488.55 2440.50 10049.76 4.12x
8192 1168.69 4066.32 3313.43 13636.33 4482.12 17702.64 3.95x
16384 1506.03 4773.60 8006.99 30381.22 9513.02 35154.82 3.70x
32768 2253.96 6231.85 21669.84 72097.46 23923.80 78329.31 3.27x
65535 3692.10 9089.09 65605.61 188746.40 69297.71 197835.48 2.85x

So only about 3 to 5x slower with better settings.

That’s mostly my experience using REAP models. I was a little disappointed there was only benchmarks.

Someone did exactly as I hoped that they would - a GB10 board targeted NVFP4 quant of the larger REAP I mentioned above.

I’m going to grab this. See if eugr’s build with Marlin and the needed variables works too - would be a good option to compare with the supposedly forthcoming “Atlas engine” too.

Resurrecting this thread because the REAPed model mentioned above is solid. The HF user who made it suggests the avarok vllm docker image, but I can confirm that with the proper startup flags and environment variables you get equivalent performance from the community vLLM container.

What really impresses me about this model is its token efficiency. It feels like MiniMax tuned it to think exactly how much it needs in order to enhance performance, and no more. Running it alongside Qwen3.5 and Step-3.5-Flash it feels much more concise - but with similar quality.

Using @eugr’s community docker, I have total context available in KV cache of 141k, currently set to 128k. Because KV cache and memory is constrained, I limit CUDA graphs to 1, 2, 3, 4 because I don’t expect to hit this with more than 4 concurrent queries.

Here is my corrected startup command (I map in a models directory rather than use HF cache, you may need to slightly tweak):

#!/bin/bash
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/models" \
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve /models/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 \
  --served-model-name MiniMax-M2.5-REAP-172B-NVFP4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.93 \
  --port 8000 \
  --host 0.0.0.0 \
  --enable-prefix-caching \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --max-num-seqs 4

llama-benchy results at tg=128. It does run all the way to 128k, but sweet spot on single spark ir probably out to around half that.

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 2349.17 ± 62.58 1003.99 ± 23.67 872.43 ± 23.67 1004.06 ± 23.67
MiniMax-M2.5-REAP-172B-NVFP4 tg128 27.60 ± 0.02 28.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 @ d2048 2130.92 ± 9.58 2053.61 ± 8.57 1922.05 ± 8.57 2053.69 ± 8.57
MiniMax-M2.5-REAP-172B-NVFP4 tg128 @ d2048 26.80 ± 0.01 28.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 @ d4096 1997.06 ± 4.34 3208.09 ± 6.69 3076.53 ± 6.69 3208.17 ± 6.69
MiniMax-M2.5-REAP-172B-NVFP4 tg128 @ d4096 25.93 ± 0.02 26.67 ± 0.47
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 @ d8192 1804.06 ± 1.40 5807.65 ± 4.40 5676.09 ± 4.40 5807.73 ± 4.39
MiniMax-M2.5-REAP-172B-NVFP4 tg128 @ d8192 24.56 ± 0.03 25.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 @ d16384 1526.03 ± 0.91 12209.96 ± 7.18 12078.41 ± 7.18 12210.05 ± 7.18
MiniMax-M2.5-REAP-172B-NVFP4 tg128 @ d16384 22.14 ± 0.02 23.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 @ d32768 1179.31 ± 0.96 29653.81 ± 23.93 29522.26 ± 23.93 29653.90 ± 23.93
MiniMax-M2.5-REAP-172B-NVFP4 tg128 @ d32768 18.70 ± 0.02 19.67 ± 0.47
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 @ d65536 813.91 ± 0.75 83167.80 ± 76.38 83036.24 ± 76.38 83167.87 ± 76.39
MiniMax-M2.5-REAP-172B-NVFP4 tg128 @ d65536 14.17 ± 0.02 15.33 ± 0.47
MiniMax-M2.5-REAP-172B-NVFP4 pp2048 @ d128000 514.16 ± 0.62 253044.18 ± 302.69 252933.32 ± 302.69 253044.27 ± 302.69
MiniMax-M2.5-REAP-172B-NVFP4 tg128 @ d128000 9.71 ± 0.02 11.00 ± 0.00

The better way would be to specify --max-num-seqs 4

Thanks! Corrected, that’s simpler and easier to work with.

I want to note that fastsafetensors is deliberately not employed; you can load the model this way but it seems the memory use is not as efficient. With fastsafetensors the model fails to start due to OOM with limit around 125000 . Without, as above, I can get true 131072 context.

I also run totally headless, but with VS Studio attached. So lean on RAM but not as lean as it could be. If you have a GUI running, you’ll probably have to cut context a bit.

Yes, fastsafetensors are not usable beyond 0.85 of total RAM

I have been lurking on here following this because I have been using this model on a single DGX and am using the latest version of the compiled docker vllm container. My issue is this model takes over 15 minutes to load on my DGX using this. Are you seeing the same times? I have modified my read ahead cache to no avail. How long does it take you to load the shards with this container? With fastsafetensors on it still takes about 12 minutes.

It does take a good amount of time, but while I haven’t clocked it that seems high.

First, let me know what your system says for baseline RAM usage. I run mine headless and lean. If you’re almost OOM, you may be pushed over by compilation of CUDA graphs. If so, the system may not instantly hard crash but it will take a very long time.

Try reducing context and mem utilization, then monitoring startup log alongside the dashboard or dgxtop.

Thanks for the response. I have been messing around with this and can’t quite understand what is going on but I have 8gb of memory left after loading and it runs fine afterwards. I enabled fastsaftensors load back on and it dropped my load times down to 2 mins which is what I would expect. I had to lower my context a bit and all is well.

@bernisse Are you using the “saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 ” model? Could you share your optimal setup? joshua.dale.warner says this model is very reliable, and I’d like to test it myself. thanks!

~/spark-vllm-docker$ cat recipes/minimax-m2.5-REAP.yaml
# Recipe: MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10
# MiniMax M2.5 model with NVFP4 quantization (Marlin backend)
# REAP (Router-weighted Expert Activation Pruning),
# which removes 25% of the least active experts (down to 192 from 256)
# while keeping the same 10B active parameters per token

recipe_version: "1"
name: MiniMax-M2.5-REAP-172B-A10B
description: vLLM MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10

# HuggingFace model to download (optional, for --download-model)
model: saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10

# Container image to use
container: vllm-node

# Can only be run in a cluster
cluster_only: false

# No mods required
mods: []

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.92
  max_model_len: 126976

# Environment variables (from model publisher's recommended config)
# Force Marlin backend to avoid FlashInfer CUTLASS TMA crash on SM120
env:
  VLLM_NVFP4_GEMM_BACKEND: "marlin"
  VLLM_TEST_FORCE_FP8_MARLIN: "1"
  VLLM_USE_FLASHINFER_MOE_FP4: "0"
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"

# The vLLM serve command template
command: |
  vllm serve saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 \
      --trust-remote-code \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --served-model-name minimax-m2.5 \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --kv-cache-dtype fp8 \
      --attention-backend flashinfer \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think \
      --max-num-seqs 4

This one you can apply to @eugr vLLM on headless single DGX Spark and start to play with. It’s impressive.