MiniMax 2.5 REAP - NVFP4 on single DGX Spark

mjpansa · February 21, 2026, 11:32am

Yesterday a REAP version of MiniMax 2.5 showed up already quantised to NVFP4:

ran benchy on it:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
MiniMax-M2.5	pp2048	3342.54 ± 141.85		720.56 ± 26.78	613.84 ± 26.78	720.64 ± 26.81
MiniMax-M2.5	tg32	16.71 ± 0.24	17.00 ± 0.00
MiniMax-M2.5	ctx_pp @ d4096	2994.70 ± 4.09		1474.47 ± 1.87	1367.75 ± 1.87	1474.53 ± 1.86
MiniMax-M2.5	ctx_tg @ d4096	16.49 ± 0.03	17.00 ± 0.00
MiniMax-M2.5	pp2048 @ d4096	2383.55 ± 23.95		966.03 ± 8.69	859.31 ± 8.69	966.08 ± 8.70
MiniMax-M2.5	tg32 @ d4096	16.27 ± 0.03	17.00 ± 0.00
MiniMax-M2.5	ctx_pp @ d8192	2554.64 ± 3.07		3313.43 ± 3.86	3206.72 ± 3.86	3313.50 ± 3.86
MiniMax-M2.5	ctx_tg @ d8192	15.85 ± 0.02	16.33 ± 0.47
MiniMax-M2.5	pp2048 @ d8192	1929.08 ± 34.21		1168.69 ± 18.78	1061.98 ± 18.78	1168.77 ± 18.78
MiniMax-M2.5	tg32 @ d8192	15.66 ± 0.02	16.00 ± 0.00
MiniMax-M2.5	ctx_pp @ d16384	2073.85 ± 1.07		8006.99 ± 4.06	7900.28 ± 4.06	8007.06 ± 4.06
MiniMax-M2.5	ctx_tg @ d16384	14.55 ± 0.26	15.33 ± 0.47
MiniMax-M2.5	pp2048 @ d16384	1463.58 ± 2.90		1506.03 ± 2.77	1399.32 ± 2.77	1506.10 ± 2.78
MiniMax-M2.5	tg32 @ d16384	14.30 ± 0.20	15.00 ± 0.00
MiniMax-M2.5	ctx_pp @ d32768	1519.62 ± 0.70		21669.84 ± 9.96	21563.12 ± 9.96	21669.91 ± 9.96
MiniMax-M2.5	ctx_tg @ d32768	12.95 ± 0.02	13.33 ± 0.47
MiniMax-M2.5	pp2048 @ d32768	953.78 ± 0.49		2253.96 ± 1.10	2147.24 ± 1.10	2254.04 ± 1.10
MiniMax-M2.5	tg32 @ d32768	12.84 ± 0.02	13.00 ± 0.00
MiniMax-M2.5	ctx_pp @ d65535	1000.55 ± 0.63		65605.61 ± 41.25	65498.89 ± 41.25	65605.67 ± 41.25
MiniMax-M2.5	ctx_tg @ d65535	10.49 ± 0.01	11.00 ± 0.00
MiniMax-M2.5	pp2048 @ d65535	571.21 ± 0.27		3692.10 ± 1.68	3585.38 ± 1.68	3692.19 ± 1.68
MiniMax-M2.5	tg32 @ d65535	10.38 ± 0.02	11.00 ± 0.00

had to change to provided vllm command slightly:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export NCCL_IB_DISABLE=1
export OMP_NUM_THREADS=8

python3 -m vllm.entrypoints.openai.api_server
–model lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
–host 0.0.0.0
–port 8000
–served-model-name MiniMax-M2.5
–trust-remote-code
–tensor-parallel-size 1
–gpu-memory-utilization 0.85
–max-num-seqs 64
–max-model-len 131072
–disable-custom-all-reduce
–attention-config.use_trtllm_attention=0
–enable-auto-tool-choice
–tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think

I didnt really have any time the past few weeks so this is still running on scitrera/dgx-spark-vllm:0.14.0-t5, not sure if recent vllm versions already fix some speed problems. Will test more tomorrow. I would guess 30 t/s and 20 t/s for long context should be possible?
Also the customizations that M2 arch brings look like they could benefit from additional fused kernels. Will also want to look at the NVFP4 speedup post and see what that might bring on top

raphael.amorim · February 21, 2026, 1:28pm

Have you tried minimax_m2 as reasoning parser instead of minimax_m2_append_think

mjpansa · February 21, 2026, 3:01pm

I have to admit that I didn’t check any outputs so far. Had to go to a wedding and will some tests tomorrow. Happy about any input

flash3 · February 21, 2026, 7:06pm

I’m just curious. Lobotomy successful, patient brain-dead? What orientation did the lobo-set have? (short for: lobotomizing dataset)

Generally speaking, it’s a good sign if it can still make coffee afterwards… ;)

aceangel · February 21, 2026, 7:47pm

I tried this one last night as well as a GLM4.7-Flash MTP NVFP4. The output was gibberish, but it’s probably a skill issue on my end. Maybe I didn’t completely use the correct parameters 😅

jwarner · February 21, 2026, 8:29pm

This one is a little too strongly REAPed. The creator intended for it to work on a RTX PRO 6000 so the target was 96GB RAM space. 40% is somewhat too much removed.

A 20-25% REAP would be better for Spark.

Edit: NVFP4 or AWQ of this one would be a great target: cerebras/MiniMax-M2.5-REAP-172B-A10B · Hugging Face

cosinus · February 25, 2026, 10:29am

A version of the 139B has landed as AWQ thanks to cyanwiki / captonn.

entrpi · February 27, 2026, 1:11am

I ran llama-benchy against the i1-Q4_K_S from https://hf.tst.eu/model#MiniMax-M2.5-REAP-139B-A10B-GGUF

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
MiniMax-M2.5-REAP-172B-A10B	pp2048	334.44 ± 160.18		7516.62 ± 2683.80	7403.41 ± 2683.80	7517.01 ± 2684.01
MiniMax-M2.5-REAP-172B-A10B	tg32	20.36 ± 5.99	22.33 ± 5.91
MiniMax-M2.5-REAP-172B-A10B	ctx_pp @ d4096	463.87 ± 51.61		9048.49 ± 948.20	8935.28 ± 948.20	9048.52 ± 948.20
MiniMax-M2.5-REAP-172B-A10B	ctx_tg @ d4096	18.24 ± 1.87	19.67 ± 2.05
MiniMax-M2.5-REAP-172B-A10B	pp2048 @ d4096	454.65 ± 25.72		4632.24 ± 255.56	4519.03 ± 255.56	4632.28 ± 255.56
MiniMax-M2.5-REAP-172B-A10B	tg32 @ d4096	17.67 ± 1.39	18.67 ± 1.70
MiniMax-M2.5-REAP-172B-A10B	ctx_pp @ d8192	337.76 ± 28.10		24530.93 ± 1967.98	24417.72 ± 1967.98	24530.98 ± 1967.99
MiniMax-M2.5-REAP-172B-A10B	ctx_tg @ d8192	13.12 ± 0.76	15.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B	pp2048 @ d8192	347.91 ± 16.29		6012.66 ± 274.07	5899.45 ± 274.07	6012.69 ± 274.06
MiniMax-M2.5-REAP-172B-A10B	tg32 @ d8192	13.22 ± 0.75	14.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B	ctx_pp @ d16384	263.29 ± 23.51		62834.47 ± 5526.55	62721.26 ± 5526.55	62834.53 ± 5526.57
MiniMax-M2.5-REAP-172B-A10B	ctx_tg @ d16384	9.61 ± 0.63	10.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B	pp2048 @ d16384	258.94 ± 17.57		8058.99 ± 539.89	7945.78 ± 539.89	8059.04 ± 539.90
MiniMax-M2.5-REAP-172B-A10B	tg32 @ d16384	9.28 ± 1.00	10.00 ± 0.82
MiniMax-M2.5-REAP-172B-A10B	ctx_pp @ d32768	191.41 ± 16.09		172493.19 ± 14154.36	172379.98 ± 14154.36	172493.91 ± 14155.10
MiniMax-M2.5-REAP-172B-A10B	ctx_tg @ d32768	6.66 ± 0.16	7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B	pp2048 @ d32768	184.07 ± 6.47		11253.80 ± 401.75	11140.59 ± 401.75	11253.86 ± 401.80
MiniMax-M2.5-REAP-172B-A10B	tg32 @ d32768	6.76 ± 0.19	7.67 ± 0.47
MiniMax-M2.5-REAP-172B-A10B	ctx_pp @ d65535	182.88 ± 3.51		358595.32 ± 6962.06	358482.10 ± 6962.06	358596.26 ± 6963.34
MiniMax-M2.5-REAP-172B-A10B	ctx_tg @ d65535	6.58 ± 0.29	7.33 ± 0.47
MiniMax-M2.5-REAP-172B-A10B	pp2048 @ d65535	177.03 ± 8.37		11707.56 ± 542.00	11594.35 ± 542.00	11707.59 ± 542.00
MiniMax-M2.5-REAP-172B-A10B	tg32 @ d65535	6.51 ± 0.03	7.33 ± 0.47

Server llama.cpp version 8123 (f75c4e8bf) built with GNU 13.3.0:

llama.cpp/build/bin/llama-server \
    –host 0.0.0.0 \
    –port 8001 \
    –model ~/models/MiniMax-M2.5-REAP-172B-A10B.i1-IQ4_XS.gguf \
    –alias openai/mradermacher/MiniMax-M2.5-REAP-172B-A10B \
    –no-mmap \
    –flash-attn on \
    –n-gpu-layers 999 \
    –ctx-size 100000 \
    –chat-template-file ~/llama.cpp/models/templates/MiniMax-M2.jinja \

Comparison table vs your NVFP4 run:

test	NVFP4 t/s	our t/s	delta abs	delta %
pp2048	3342.54	334.44	-3008.10	-90.0%
tg32	16.71	20.36	+3.65	+21.8%
ctx_pp @ d4096	2994.70	463.87	-2530.83	-84.5%
ctx_tg @ d4096	16.49	18.24	+1.75	+10.6%
pp2048 @ d4096	2383.55	454.65	-1928.90	-80.9%
tg32 @ d4096	16.27	17.67	+1.40	+8.6%
ctx_pp @ d8192	2554.64	337.76	-2216.88	-86.8%
ctx_tg @ d8192	15.85	13.12	-2.73	-17.2%
pp2048 @ d8192	1929.08	347.91	-1581.17	-82.0%
tg32 @ d8192	15.66	13.22	-2.44	-15.6%
ctx_pp @ d16384	2073.85	263.29	-1810.56	-87.3%
ctx_tg @ d16384	14.55	9.61	-4.94	-34.0%
pp2048 @ d16384	1463.58	258.94	-1204.64	-82.3%
tg32 @ d16384	14.30	9.28	-5.02	-35.1%
ctx_pp @ d32768	1519.62	191.41	-1328.21	-87.4%
ctx_tg @ d32768	12.95	6.66	-6.29	-48.6%
pp2048 @ d32768	953.78	184.07	-769.71	-80.7%
tg32 @ d32768	12.84	6.76	-6.08	-47.4%
ctx_pp @ d65535	1000.55	182.88	-817.67	-81.7%
ctx_tg @ d65535	10.49	6.58	-3.91	-37.3%
pp2048 @ d65535	571.21	177.03	-394.18	-69.0%
tg32 @ d65535	10.38	6.51	-3.87	-37.3%

Takeaways:

My run (llama.cpp + GGUF i1-IQ4_XS) is much slower on prefill than the NVFP4+vLLM run: roughly -69% to -90% on pp2048 / ctx_pp.
Decode at short depth is good: tg32 and tg32 @ d4096 are actually higher than NVFP4 (+22%, +9%).
As context depth increases, GGUF decode drops below NVFP4:
- around -17% at d8192
- around -34% to -49% from d16384 to d32768
- about -37% at d65535.

Then comparing ttfr:

depth	NVFP4 pp2048 ttfr (ms)	GGUF pp2048 ttfr (ms)	NVFP4 ctx_pp ttfr (ms)	GGUF ctx_pp ttfr (ms)	NVFP4 combined (ms)	GGUF combined (ms)	combined slowdown
0	720.56	7516.62	-	-	720.56	7516.62	10.43x
4096	966.03	4632.24	1474.47	9048.49	2440.50	13680.73	5.61x
8192	1168.69	6012.66	3313.43	24530.93	4482.12	30543.59	6.81x
16384	1506.03	8058.99	8006.99	62834.47	9513.02	70893.46	7.45x
32768	2253.96	11253.80	21669.84	172493.19	23923.80	183746.99	7.68x
65535	3692.10	11707.56	65605.61	358595.32	69297.71	370302.88	5.34x

So generally 5-8x slower across long contexts.

entrpi · February 27, 2026, 8:45am

Update: changed settings as below and retested.

--ctx-size: 100000 → 80000
--parallel: auto/4 → 1
--cache-ram: default enabled (8192 MiB) → 0 (disabled)
n_slots (effective): 4 → 1
kv_unified: true → false (because parallel=1)
KV cache allocation: ~24242 MiB → ~19406 MiB

test	NVFP4	old_ctx100k_p4	new_ctx80k_p1	new vs NVFP4	new vs old
pp2048	3342.54	334.44	642.37	-80.8%	+92.1%
tg32	16.71	20.36	26.15	+56.5%	+28.5%
ctx_pp @ d4096	2994.70	463.87	640.05	-78.6%	+38.0%
ctx_tg @ d4096	16.49	18.24	25.16	+52.6%	+38.0%
pp2048 @ d4096	2383.55	454.65	589.84	-75.3%	+29.7%
tg32 @ d4096	16.27	17.67	24.23	+48.9%	+37.1%
ctx_pp @ d8192	2554.64	337.76	604.70	-76.3%	+79.0%
ctx_tg @ d8192	15.85	13.12	22.33	+40.9%	+70.1%
pp2048 @ d8192	1929.08	347.91	514.93	-73.3%	+48.0%
tg32 @ d8192	15.66	13.22	19.68	+25.7%	+48.9%
ctx_pp @ d16384	2073.85	263.29	540.87	-73.9%	+105.4%
ctx_tg @ d16384	14.55	9.61	17.72	+21.8%	+84.4%
pp2048 @ d16384	1463.58	258.94	437.18	-70.1%	+68.8%
tg32 @ d16384	14.30	9.28	16.85	+17.8%	+81.5%
ctx_pp @ d32768	1519.62	191.41	455.06	-70.1%	+137.7%
ctx_tg @ d32768	12.95	6.66	13.26	+2.4%	+99.2%
pp2048 @ d32768	953.78	184.07	333.40	-65.0%	+81.1%
tg32 @ d32768	12.84	6.76	12.94	+0.8%	+91.5%
ctx_pp @ d65535	1000.55	182.88	347.37	-65.3%	+89.9%
ctx_tg @ d65535	10.49	6.58	8.68	-17.2%	+32.0%
pp2048 @ d65535	571.21	177.03	227.56	-60.2%	+28.5%
tg32 @ d65535	10.38	6.51	8.51	-18.0%	+30.8%

New results vs NVFP4:

Throughput Comparison (t/s)

test	NVFP4 t/s	new t/s	delta abs	delta %
pp2048	3342.54	642.37	-2700.17	-80.8%
tg32	16.71	26.15	+9.44	+56.5%
ctx_pp @ d4096	2994.70	640.05	-2354.65	-78.6%
ctx_tg @ d4096	16.49	25.16	+8.67	+52.6%
pp2048 @ d4096	2383.55	589.84	-1793.71	-75.3%
tg32 @ d4096	16.27	24.23	+7.96	+48.9%
ctx_pp @ d8192	2554.64	604.70	-1949.94	-76.3%
ctx_tg @ d8192	15.85	22.33	+6.48	+40.9%
pp2048 @ d8192	1929.08	514.93	-1414.15	-73.3%
tg32 @ d8192	15.66	19.68	+4.02	+25.7%
ctx_pp @ d16384	2073.85	540.87	-1532.98	-73.9%
ctx_tg @ d16384	14.55	17.72	+3.17	+21.8%
pp2048 @ d16384	1463.58	437.18	-1026.40	-70.1%
tg32 @ d16384	14.30	16.85	+2.55	+17.8%
ctx_pp @ d32768	1519.62	455.06	-1064.56	-70.1%
ctx_tg @ d32768	12.95	13.26	+0.31	+2.4%
pp2048 @ d32768	953.78	333.40	-620.38	-65.0%
tg32 @ d32768	12.84	12.94	+0.10	+0.8%
ctx_pp @ d65535	1000.55	347.37	-653.18	-65.3%
ctx_tg @ d65535	10.49	8.68	-1.81	-17.2%
pp2048 @ d65535	571.21	227.56	-343.65	-60.2%
tg32 @ d65535	10.38	8.51	-1.87	-18.0%

Now, tg is faster in GGUF, except at longest contexts.

Runtime/Latency Comparison (ttfr-based)
For depth > 0, combined = ctx_pp ttfr + pp2048 ttfr.

depth	NVFP4 pp2048 ttfr (ms)	GGUF pp2048 ttfr (ms)	NVFP4 ctx_pp ttfr (ms)	GGUF ctx_pp ttfr (ms)	NVFP4 combined (ms)	GGUF combined (ms)	slowdown
0	720.56	3278.15	-	-	720.56	3278.15	4.55x
4096	966.03	3561.20	1474.47	6488.55	2440.50	10049.76	4.12x
8192	1168.69	4066.32	3313.43	13636.33	4482.12	17702.64	3.95x
16384	1506.03	4773.60	8006.99	30381.22	9513.02	35154.82	3.70x
32768	2253.96	6231.85	21669.84	72097.46	23923.80	78329.31	3.27x
65535	3692.10	9089.09	65605.61	188746.40	69297.71	197835.48	2.85x

So only about 3 to 5x slower with better settings.

Marsy · March 1, 2026, 11:11pm

That’s mostly my experience using REAP models. I was a little disappointed there was only benchmarks.

jwarner · March 5, 2026, 5:39am

Someone did exactly as I hoped that they would - a GB10 board targeted NVFP4 quant of the larger REAP I mentioned above.

I’m going to grab this. See if eugr’s build with Marlin and the needed variables works too - would be a good option to compare with the supposedly forthcoming “Atlas engine” too.

joshua.dale.warner · March 23, 2026, 4:45am

Resurrecting this thread because the REAPed model mentioned above is solid. The HF user who made it suggests the avarok vllm docker image, but I can confirm that with the proper startup flags and environment variables you get equivalent performance from the community vLLM container.

What really impresses me about this model is its token efficiency. It feels like MiniMax tuned it to think exactly how much it needs in order to enhance performance, and no more. Running it alongside Qwen3.5 and Step-3.5-Flash it feels much more concise - but with similar quality.

Using @eugr’s community docker, I have total context available in KV cache of 141k, currently set to 128k. Because KV cache and memory is constrained, I limit CUDA graphs to 1, 2, 3, 4 because I don’t expect to hit this with more than 4 concurrent queries.

Here is my corrected startup command (I map in a models directory rather than use HF cache, you may need to slightly tweak):

#!/bin/bash
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/models" \
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve /models/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 \
  --served-model-name MiniMax-M2.5-REAP-172B-NVFP4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.93 \
  --port 8000 \
  --host 0.0.0.0 \
  --enable-prefix-caching \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --max-num-seqs 4

llama-benchy results at tg=128. It does run all the way to 128k, but sweet spot on single spark ir probably out to around half that.

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
MiniMax-M2.5-REAP-172B-NVFP4	pp2048	2349.17 ± 62.58		1003.99 ± 23.67	872.43 ± 23.67	1004.06 ± 23.67
MiniMax-M2.5-REAP-172B-NVFP4	tg128	27.60 ± 0.02	28.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4	pp2048 @ d2048	2130.92 ± 9.58		2053.61 ± 8.57	1922.05 ± 8.57	2053.69 ± 8.57
MiniMax-M2.5-REAP-172B-NVFP4	tg128 @ d2048	26.80 ± 0.01	28.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4	pp2048 @ d4096	1997.06 ± 4.34		3208.09 ± 6.69	3076.53 ± 6.69	3208.17 ± 6.69
MiniMax-M2.5-REAP-172B-NVFP4	tg128 @ d4096	25.93 ± 0.02	26.67 ± 0.47
MiniMax-M2.5-REAP-172B-NVFP4	pp2048 @ d8192	1804.06 ± 1.40		5807.65 ± 4.40	5676.09 ± 4.40	5807.73 ± 4.39
MiniMax-M2.5-REAP-172B-NVFP4	tg128 @ d8192	24.56 ± 0.03	25.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4	pp2048 @ d16384	1526.03 ± 0.91		12209.96 ± 7.18	12078.41 ± 7.18	12210.05 ± 7.18
MiniMax-M2.5-REAP-172B-NVFP4	tg128 @ d16384	22.14 ± 0.02	23.00 ± 0.00
MiniMax-M2.5-REAP-172B-NVFP4	pp2048 @ d32768	1179.31 ± 0.96		29653.81 ± 23.93	29522.26 ± 23.93	29653.90 ± 23.93
MiniMax-M2.5-REAP-172B-NVFP4	tg128 @ d32768	18.70 ± 0.02	19.67 ± 0.47
MiniMax-M2.5-REAP-172B-NVFP4	pp2048 @ d65536	813.91 ± 0.75		83167.80 ± 76.38	83036.24 ± 76.38	83167.87 ± 76.39
MiniMax-M2.5-REAP-172B-NVFP4	tg128 @ d65536	14.17 ± 0.02	15.33 ± 0.47
MiniMax-M2.5-REAP-172B-NVFP4	pp2048 @ d128000	514.16 ± 0.62		253044.18 ± 302.69	252933.32 ± 302.69	253044.27 ± 302.69
MiniMax-M2.5-REAP-172B-NVFP4	tg128 @ d128000	9.71 ± 0.02	11.00 ± 0.00

eugr · March 23, 2026, 5:03pm

The better way would be to specify --max-num-seqs 4

joshua.dale.warner · March 23, 2026, 9:31pm

Thanks! Corrected, that’s simpler and easier to work with.

I want to note that fastsafetensors is deliberately not employed; you can load the model this way but it seems the memory use is not as efficient. With fastsafetensors the model fails to start due to OOM with limit around 125000 . Without, as above, I can get true 131072 context.

I also run totally headless, but with VS Studio attached. So lean on RAM but not as lean as it could be. If you have a GUI running, you’ll probably have to cut context a bit.

eugr · March 23, 2026, 10:09pm

Yes, fastsafetensors are not usable beyond 0.85 of total RAM

bernisse · March 27, 2026, 4:55pm

I have been lurking on here following this because I have been using this model on a single DGX and am using the latest version of the compiled docker vllm container. My issue is this model takes over 15 minutes to load on my DGX using this. Are you seeing the same times? I have modified my read ahead cache to no avail. How long does it take you to load the shards with this container? With fastsafetensors on it still takes about 12 minutes.

jwarner · March 28, 2026, 12:53am

It does take a good amount of time, but while I haven’t clocked it that seems high.

First, let me know what your system says for baseline RAM usage. I run mine headless and lean. If you’re almost OOM, you may be pushed over by compilation of CUDA graphs. If so, the system may not instantly hard crash but it will take a very long time.

Try reducing context and mem utilization, then monitoring startup log alongside the dashboard or dgxtop.

bernisse · March 28, 2026, 4:05am

Thanks for the response. I have been messing around with this and can’t quite understand what is going on but I have 8gb of memory left after loading and it runs fine afterwards. I enabled fastsaftensors load back on and it dropped my load times down to 2 mins which is what I would expect. I had to lower my context a bit and all is well.

270446054 · March 31, 2026, 5:03am

@bernisse Are you using the “saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 ” model? Could you share your optimal setup? joshua.dale.warner says this model is very reliable, and I’d like to test it myself. thanks！

eugene.karasev · March 31, 2026, 8:37am

~/spark-vllm-docker$ cat recipes/minimax-m2.5-REAP.yaml
# Recipe: MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10
# MiniMax M2.5 model with NVFP4 quantization (Marlin backend)
# REAP (Router-weighted Expert Activation Pruning),
# which removes 25% of the least active experts (down to 192 from 256)
# while keeping the same 10B active parameters per token

recipe_version: "1"
name: MiniMax-M2.5-REAP-172B-A10B
description: vLLM MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10

# HuggingFace model to download (optional, for --download-model)
model: saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10

# Container image to use
container: vllm-node

# Can only be run in a cluster
cluster_only: false

# No mods required
mods: []

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.92
  max_model_len: 126976

# Environment variables (from model publisher's recommended config)
# Force Marlin backend to avoid FlashInfer CUTLASS TMA crash on SM120
env:
  VLLM_NVFP4_GEMM_BACKEND: "marlin"
  VLLM_TEST_FORCE_FP8_MARLIN: "1"
  VLLM_USE_FLASHINFER_MOE_FP4: "0"
  VLLM_MARLIN_USE_ATOMIC_ADD: "1"

# The vLLM serve command template
command: |
  vllm serve saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 \
      --trust-remote-code \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --served-model-name minimax-m2.5 \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --kv-cache-dtype fp8 \
      --attention-backend flashinfer \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think \
      --max-num-seqs 4

This one you can apply to @eugr vLLM on headless single DGX Spark and start to play with. It’s impressive.

Topic		Replies	Views
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	92	6153	April 12, 2026
MiniMax M2.7 NFVP4 Recipe & Benchmarks DGX Spark / GB10 llama	87	6790	May 3, 2026
MiniMax-2.5 on DGX Spark (thanks to Unsloth https://unsloth.ai/docs/models/minimax-2.5) DGX Spark / GB10 llama	12	3636	February 20, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2393	December 25, 2025
MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16 DGX Spark / GB10	12	855	April 22, 2026
DGX Spark performance DGX Spark / GB10	50	4572	February 27, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1339	February 13, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	7157	March 28, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2603	March 26, 2026
Can someone with 2 Sparks benchmark NVFP4 MiniMax M2.1 quant? DGX Spark / GB10	25	1428	January 29, 2026

MiniMax 2.5 REAP - NVFP4 on single DGX Spark

Related topics