MiniMax M2.7 NFVP4 Recipe & Benchmarks

serapis · April 12, 2026, 6:49pm

I had a chance to play with lukealonso/MiniMax-M2.7-NVFP4 and wanted to share my very first results (Dual Node Setup - 2x Asus Ascent GX10). I let vLLM calculate context and was able to get 196608 – not too terrible.

Benchmarks:

| model        |            test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------|----------------:|-----------------:|-------------:|------------------:|------------------:|------------------:|
| MiniMax-M2.7 |          pp2048 | 2074.25 ± 223.23 |              |   865.36 ± 108.23 |   863.94 ± 108.23 |   865.42 ± 108.22 |
| MiniMax-M2.7 |           tg128 |     24.30 ± 0.02 | 25.00 ± 0.00 |                   |                   |                   |
| MiniMax-M2.7 |  pp2048 @ d4096 | 2249.52 ± 305.90 |              |  2377.95 ± 302.17 |  2376.52 ± 302.17 |  2378.01 ± 302.18 |
| MiniMax-M2.7 |   tg128 @ d4096 |     23.75 ± 0.08 | 24.33 ± 0.47 |                   |                   |                   |
| MiniMax-M2.7 |  pp2048 @ d8192 | 2146.48 ± 526.28 |              | 4515.71 ± 1247.48 | 4514.29 ± 1247.48 | 4515.80 ± 1247.48 |
| MiniMax-M2.7 |   tg128 @ d8192 |     22.92 ± 0.14 | 24.00 ± 0.82 |                   |                   |                   |
| MiniMax-M2.7 | pp2048 @ d16384 |   2471.71 ± 7.59 |              |   6474.06 ± 67.79 |   6472.63 ± 67.79 |   6474.11 ± 67.78 |
| MiniMax-M2.7 |  tg128 @ d16384 |     21.78 ± 0.41 | 23.00 ± 0.00 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-04-12 20:43:17 | latency mode: api

My config:

vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
    --host 0.0.0.0 \
    --port 8888 \
    --max-num-seqs 4 \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.92 \
    --mamba_ssm_cache_dtype float32 \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --served-model-name MiniMax-M2.7 \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --kv-cache-dtype fp8 \
    --quantization modelopt_fp4 \
    --moe-backend cutlass \
    --disable-custom-all-reduce \
    --dtype auto \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

Environment variables used:

VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=cutlass
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
OMP_NUM_THREADS=8

I run this via @eugr’s spark-vllm-docker with the TF5 container.

I’ll play around with other config options over the coming days but am pretty happy with the result for a first try. Very keen to see what other people achieve, too!

Cheers

paxren2020 · April 12, 2026, 6:53pm

Could you please test the llama-bench model with the following parameters:

–pp 1000 --tg 128 --depth 1000 10000 15000 20000 30000 40000 70000 100000 --concurrency 1 2

vedcsolution · April 12, 2026, 6:59pm

@joshua.dale.warner (https://forums.developer.nvidia.com/u/joshua.dale.warner)

mrtime

3h

Everyone considering this model needs to very, very carefully look at the modified license MM2.7 was released under.

MM2.7 is NOT open licensed. It is now fundamentally under a non-commercial license. If you make any money with it, or using derivatives of its output, you are opened up to be sued into oblivion.

Claiming this is “open source” is a travesty and wildly dishonest by MiniMax.

Worse, in the repo commit history it had a proper license and then they changed it seemingly mere moments before release.

It would be interesting if the community backwards engineered their claimed improvements from and using MM2.5 - because apparently self-improvement was a huge part of the evolution to MM2.7.

They used what sounds like Autoresearch or a similar harness. We could basically fork the model family from where it was open licensed. I don’t know how it’s controlled, but it takes away any desire to try it. Maybe they’ll release an open-source version, I don’t know.

ekkis · April 12, 2026, 7:15pm

I’ve been playing with it as well, reposting my post from the 2.5 thread below. Your numbers look better than mine so I’ll update my recipe with your settings. Are you running with or without ray?

Repost below:

The M2.7 release probably needs it’s own thread, but I’ll post this here for now. I’m testing the lukealonso/MiniMax-M2.7-NVFP4 quant on my 2x Asus Gx10 cluster and initial results are promising. I tried first with Cutlass backend:

recipe_version: "1"
name: MiniMax-M2.7-NVFP4-CUTLASS
description: vLLM serving MiniMax-M2.7-NVFP4 using stable CUTLASS NvFP4 backend on GB10/SM121

model: lukealonso/MiniMax-M2.7-NVFP4
container: vllm-node-tf5
cluster_only: false
mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.80
  max_model_len: 196000
  max_num_seqs: 5

env:
  VLLM_NVFP4_GEMM_BACKEND: "cutlass"
  VLLM_USE_FLASHINFER_MOE_FP4: 0
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1

command: |
  vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
      --trust-remote-code \
      --kv-cache-dtype fp8 \
      --moe-backend cutlass \
      --attention-backend TRITON_ATTN \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --max-num-seqs {max_num_seqs} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think

A quick benchmark gave this:

pp=2048, tg=32, depths 0/4096/8192/16384/32768, runs=3, latency-mode=generation

CUTLASS
Latency: 311.2 ms

| Depth | PP tok/s | TG tok/s | E2E TTFT |
| ----- | -------- | -------- | -------- |
| 0     | 2463     | 17.75    | 1.09s    |
| 4096  | 1848     | 11.91    | 3.44s    |
| 8192  | 1601     | 11.71    | 6.32s    |
| 16384 | 1257     | 10.49    | 14.10s   |
| 32768 | 864      | 9.48     | 38.39s

I then switched the recipe to use flashinfer-cutlass instead:

# Recipe: MiniMax-M2.7-NVFP4 (FlashInfer test)
# Duplicated from the CUTLASS-stable MiniMax M2.7 recipe for A/B testing.
# Keep NVFP4 GEMM on CUTLASS, but switch attention + MoE FP4 path to FlashInfer.
# Also explicitly disable TRT-LLM attention path per Eugr's SM100-only note.

recipe_version: "1"
name: MiniMax-M2.7-NVFP4-FlashInfer-Test
description: Experimental vLLM serving MiniMax-M2.7-NVFP4 with FlashInfer attention and FlashInfer MoE FP4 test path on GB10/SM121

model: lukealonso/MiniMax-M2.7-NVFP4
container: vllm-node-tf5
cluster_only: false
mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.80
  max_model_len: 196000
  max_num_seqs: 5

env:
  VLLM_NVFP4_GEMM_BACKEND: "flashinfer-cutlass"
  VLLM_USE_FLASHINFER_MOE_FP4: 1
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1

command: |
  vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
      --trust-remote-code \
      --kv-cache-dtype fp8 \
      --attention-backend flashinfer \
      --attention-config.use_trtllm_attention=0 \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --max-num-seqs {max_num_seqs} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think

Benchy results were now much better:

FLASHINFER
Latency: 166.7 ms

| Depth | PP tok/s | TG tok/s | E2E TTFT |
| ----- | -------- | -------- | -------- |
| 0     | 2904     | 20.87    | 0.83s    |
| 4096  | 2478     | 20.44    | 2.48s    |
| 8192  | 2295     | 19.99    | 4.38s    |
| 16384 | 2046     | 19.13    | 8.68s    |
| 32768 | 1739     | 17.84    | 19.01s   |

I also ran ToolCall15 which it aced, unlike the 397b which scored 27/30:

| Backend    | Points | Final score | Rating          | Non-pass cases |
| ---------- | ------ | ----------- | --------------- | -------------- |
| CUTLASS    | 30/30  | 100         | ★★★★★ Excellent | 0              |
| FLASHINFER | 30/30  | 100         | ★★★★★ Excellent | 0              |

| Category            | Score |
| ------------------- | ----- |
| Tool Selection      | 100%  |
| Parameter Precision | 100%  |
| Multi-Step Chains   | 100%  |
| Restraint & Refusal | 100%  |
| Error Recovery      | 100%  |

For reference this is what I got on the 397b:

| Category            | Score |
| ------------------- | ----- |
| Tool Selection      | 100%  |
| Parameter Precision | 100%  |
| Multi-Step Chains   | 100%  |
| Restraint & Refusal | 83%   |
| Error Recovery      | 67%   |

Non-pass notes
TC-11: used the calculator unnecessarily for simple math
TC-15: did not preserve the exact searched value across tool calls

serapis · April 12, 2026, 7:37pm

Thanks for sharing! I’ll spend more time optimizing things tomorrow. We may still get a little bit more squeezed out of this model. I hope I can get 262k context to work, too.

aostang · April 12, 2026, 11:41pm

MiniMax advertises M2.7 as having a context window of 200K. I would be cautious about anything more than that for anything that actually matters.

aostang · April 12, 2026, 11:56pm

This is what I’m seeing across 4 Sparks with NVFP4.

(APIServer pid=900) INFO 04-12 23:53:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:37 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:47 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:57 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%

I see Unsloth posted an FP8. I’m going to give that one a spin now.

mrtime · April 13, 2026, 1:04am

aostang · April 13, 2026, 1:35am

And this is what I’m seeing running Unsloth’s FP8 across 4 Sparks.

(APIServer pid=905) INFO 04-13 01:26:11 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=905) INFO 04-13 01:26:21 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%

Kind of surprised. I’m actually seeing not only no degradation in throughput, but an increase in throughput, even though I switched from NVFP4 to the full FP8 model. And when I have a cache hit:

(APIServer pid=905) INFO 04-13 01:33:31 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 44.1%
(APIServer pid=905) INFO 04-13 01:33:41 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 44.1%

eugr · April 13, 2026, 2:21am

FYI, Flashinfer-cutlass (the default NVFP4 backend) seems to be stable enough now that I’m planning to switch all my NVFP4 recipes over to it from VLLM_CUTLASS. The autotuner exceptions are now fixed, and some minor optimizations have been merged into flashinfer recently.

ekkis · April 13, 2026, 4:17am

I pulled the latest wheels of spark-vllm-docker yesterday and reranmy benchmarks as well as some additional tests, and flashinfer-cutlass got a nice boost now that autotune runs without errors: earlier I was using 0.19.1rc1.dev71+gdd9342e6b.d20260410, now I’m on 0.19.1rc1.dev219+g72ff142c3.d20260412.

I ran a quick test matrix, forum-cutlass is @serapis settings posted above. Bear in mind my headnode is limited to 2150mhz so it doesn’t freeze.

cell	latency ms	avg PP	avg TG	32k TTFT s	read
forum-cutlass, Ray	122.13	2312	22.04	17.55	baseline
forum-cutlass, no-Ray	140.59	2364	22.99	17.50	slower, tiny throughput bump
flashinfer-cutlass + throughput, Ray	111.29	3003	22.98	14.44	best latency
flashinfer-cutlass + throughput, no-Ray	122.52	3065	24.12	14.30	best overall profile
flashinfer-trtllm + latency, Ray	114.49	3033	22.74	14.50	close, not better
flashinfer-trtllm + latency, no-Ray	132.75	3078	23.90	14.43	close, not better

Results for the winner:

depth	PP tok/s	TG tok/s	TTFT s
0	3368	25.81	0.691
4096	3647	25.19	1.699
8192	3174	24.57	3.152
16384	2825	23.50	6.279
32768	2309	21.51	14.304

And full settings for the recipe:

# Recipe: MiniMax-M2.7-NVFP4 (FlashInfer test)
# Duplicated from the CUTLASS-stable MiniMax M2.7 recipe for A/B testing.
# Keep NVFP4 GEMM on CUTLASS, but switch attention + MoE FP4 path to FlashInfer.
# Also explicitly disable TRT-LLM attention path per Eugr's SM100-only note.

recipe_version: "1"
name: MiniMax-M2.7-NVFP4-FlashInfer-Test
description: Experimental vLLM serving MiniMax-M2.7-NVFP4 with FlashInfer attention and FlashInfer MoE FP4 test path on GB10/SM121

model: lukealonso/MiniMax-M2.7-NVFP4
container: vllm-node-tf5
cluster_only: false
mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.8057
  max_model_len: 225000
  max_num_seqs: 5

env:
  VLLM_NVFP4_GEMM_BACKEND: flashinfer-cutlass
  VLLM_USE_FLASHINFER_MOE_FP4: 1
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
  OMP_NUM_THREADS: 8
  VLLM_FLOAT32_MATMUL_PRECISION: high
  VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: 1
  VLLM_FLASHINFER_MOE_BACKEND: throughput

command: |
  vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
      --trust-remote-code \
      --kv-cache-dtype fp8 \
      --dtype auto \
      --quantization modelopt_fp4 \
      --attention-backend flashinfer \
      --max-num-batched-tokens 8192 \
      --mamba_ssm_cache_dtype float32 \
      --disable-custom-all-reduce \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --max-num-seqs {max_num_seqs} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think

First impressions in Claude Code are good, it seems much more thorough than 397b and both ran and fixed some tests in my code that 397b had been ignoring ever since it wrote them. Prefix cache hit rate hovering around 95% so responses are very snappy.

voktolom · April 13, 2026, 8:18am

int4-AutoRound turns out that this format is inferior to the NVFP4 format?
For the speed + quality + stability ratio, NVFP4 is currently the best option for DGX Spark?

In the context, important updates have been made to the NVFP4 format!

cosinus · April 13, 2026, 9:10am

Wild reactions across the net:

From the makers:

More important for us in here I think:

…so as long as you pile up your Sparks at home and use it, you should be fine. 😅

miken · April 13, 2026, 9:17am

I just tested MiniMax M2.7 on my dual Spark setup using:

  Image: vllm-node (eugr spark-vllm-docker, --rebuild-vllm --tf5)                                                                                                            
  vLLM: 0.19.1rc1.dev221                                                                                                                                                     
  Hardware: 2x DGX Spark, CX7 200Gbps direct connect                                                                                                      
                                                                                                                                                                             
  Environment:                                                                                                                                                               
    NCCL_IB_DISABLE=0                                                                                                                                                        
    NCCL_P2P_DISABLE=1                                      
    VLLM_USE_FLASHINFER_MOE_FP16=1
    VLLM_USE_DEEP_GEMM=0                                                                                                                                                     
    VLLM_USE_FLASHINFER_SAMPLER=0
    OMP_NUM_THREADS=4                                                                                                                                                        
                                                            
  vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \                                                                                                                                
      --trust-remote-code \
      --port 8000 \                                                                                                                                                          
      --host 0.0.0.0 \                                      
      --gpu-memory-utilization 0.85 \
      -tp 2 \                                                                                                                                                                
      --distributed-executor-backend ray \
      --max-model-len 196608 \                                                                                                                                               
      --load-format fastsafetensors \                       
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \                                                                                                                                        
      --reasoning-parser minimax_m2 \
      --attention-backend FLASHINFER

Here are the Llama Benchy results:

  MiniMax-M2.7 AWQ 4-bit on 2x DGX Spark (TP=2, CX7 200Gbps)   
  vLLM 0.19.1rc1.dev221 (eugr spark-vllm-docker, --rebuild-vllm --tf5)                                                                                                       
  gpu-memory-utilization 0.85, max-model-len 196608, FlashInfer attention, fastsafetensors                                                                                   
                                                                                                                                                                             
  | model                          |             test |             t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |                      
  |:-------------------------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:|                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |           pp2048 |  2900.93 ± 3.91 |              |      707.52 ± 0.95 |      705.98 ± 0.95 |      707.59 ± 0.94 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |            tg128 |    38.32 ± 0.03 | 39.00 ± 0.00 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   pp2048 @ d4096 |  2788.71 ± 1.79 |              |     2204.70 ± 1.41 |     2203.17 ± 1.41 |     2204.78 ± 1.41 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |    tg128 @ d4096 |    34.89 ± 0.40 | 36.00 ± 0.00 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   pp2048 @ d8192 |  2656.71 ± 6.06 |              |     3855.83 ± 8.96 |     3854.29 ± 8.96 |     3855.92 ± 8.94 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |    tg128 @ d8192 |    32.83 ± 0.16 | 33.67 ± 0.47 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  pp2048 @ d16384 | 2389.09 ± 46.57 |              |   7719.58 ± 152.55 |   7718.05 ± 152.55 |   7719.66 ± 152.56 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   tg128 @ d16384 |    28.30 ± 0.23 | 29.00 ± 0.00 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  pp2048 @ d32768 | 2044.32 ± 13.42 |              |  17032.86 ± 112.31 |  17031.32 ± 112.31 |  17032.93 ± 112.31 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   tg128 @ d32768 |    22.68 ± 0.14 | 23.67 ± 0.47 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  pp2048 @ d65536 |  1570.44 ± 2.67 |              |   43036.74 ± 73.18 |   43035.20 ± 73.18 |   43036.80 ± 73.18 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   tg128 @ d65536 |    16.14 ± 0.06 | 17.67 ± 0.47 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit | pp2048 @ d131072 |  1064.93 ± 3.69 |              | 125006.15 ± 434.25 | 125004.61 ± 434.25 | 125006.24 ± 434.25 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  tg128 @ d131072 |    10.28 ± 0.07 | 12.00 ± 0.00 |                    |                    |                    |                      
                                                                                                                                                                             
  llama-benchy (0.3.5)

I also ran a small suite of OpenClaw usage tests I created, and it beat MiniMax 2.5, Qwen3.5 122B, Qwen3.5 397B and even Haiku 4.5.

I’m really pleased with it, although I am hoping the int4 Autoround will be a bit faster once it’s released.

So just wanted to say thanks to Eugr and the rest of the community for all their hard work!

serapis · April 13, 2026, 9:57am

This is for the nvfp4 one based on @ekkis with subtle tweaks:

| model        |             test |             t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:-------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:|
| MiniMax-M2.7 |           pp2048 | 3146.24 ± 16.60 |              |      581.68 ± 5.88 |      580.03 ± 5.88 |      581.76 ± 5.87 |
| MiniMax-M2.7 |            tg128 |    25.69 ± 0.08 | 26.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d4096 |  4136.09 ± 9.54 |              |    1304.96 ± 17.19 |    1303.31 ± 17.19 |    1305.02 ± 17.18 |
| MiniMax-M2.7 |    tg128 @ d4096 |    25.01 ± 0.05 | 26.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d8192 | 3532.93 ± 30.21 |              |     2571.29 ± 2.47 |     2569.64 ± 2.47 |     2571.36 ± 2.47 |
| MiniMax-M2.7 |    tg128 @ d8192 |    24.48 ± 0.05 | 25.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d16384 | 3401.70 ± 14.73 |              |    4731.97 ± 75.26 |    4730.32 ± 75.26 |    4732.05 ± 75.26 |
| MiniMax-M2.7 |   tg128 @ d16384 |    23.37 ± 0.08 | 24.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d32768 | 2638.56 ± 13.09 |              |  11539.58 ± 191.11 |  11537.93 ± 191.11 |  11539.64 ± 191.12 |
| MiniMax-M2.7 |   tg128 @ d32768 |    21.55 ± 0.06 | 22.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d65536 | 1815.83 ± 14.16 |              |  32460.81 ± 377.12 |  32459.16 ± 377.12 |  32460.87 ± 377.13 |
| MiniMax-M2.7 |   tg128 @ d65536 |    18.64 ± 0.11 | 20.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 | pp2048 @ d131072 |  1117.27 ± 6.17 |              | 104292.66 ± 704.48 | 104291.01 ± 704.48 | 104292.75 ± 704.52 |
| MiniMax-M2.7 |  tg128 @ d131072 |    11.26 ± 2.79 | 13.67 ± 1.70 |                    |                    |                    |

And here is the AWQ one – that seems to be a winner indeed:

| model        |             test |             t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:-------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:|
| MiniMax-M2.7 |           pp2048 | 2898.23 ± 28.92 |              |     625.28 ± 10.79 |     623.73 ± 10.79 |     625.32 ± 10.79 |
| MiniMax-M2.7 |            tg128 |    39.39 ± 0.06 | 40.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d4096 |  3570.38 ± 7.00 |              |    1494.05 ± 29.83 |    1492.50 ± 29.83 |    1494.11 ± 29.82 |
| MiniMax-M2.7 |    tg128 @ d4096 |    37.95 ± 0.10 | 39.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d8192 | 3157.62 ± 23.87 |              |    2858.67 ± 14.45 |    2857.12 ± 14.45 |    2858.73 ± 14.45 |
| MiniMax-M2.7 |    tg128 @ d8192 |    37.04 ± 0.30 | 37.67 ± 0.47 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d16384 |  2995.08 ± 4.62 |              |    5259.94 ± 26.68 |    5258.39 ± 26.68 |    5259.99 ± 26.68 |
| MiniMax-M2.7 |   tg128 @ d16384 |    34.35 ± 0.13 | 35.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d32768 |  2380.92 ± 4.14 |              |   12728.98 ± 83.76 |   12727.43 ± 83.76 |   12729.05 ± 83.75 |
| MiniMax-M2.7 |   tg128 @ d32768 |    30.58 ± 0.09 | 31.33 ± 0.47 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d65536 |  1696.77 ± 4.17 |              |  34787.96 ± 205.02 |  34786.41 ± 205.02 |  34788.04 ± 205.02 |
| MiniMax-M2.7 |   tg128 @ d65536 |    25.15 ± 0.14 | 27.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 | pp2048 @ d131072 |  1080.73 ± 2.02 |              | 107408.95 ± 586.57 | 107407.40 ± 586.57 | 107409.01 ± 586.57 |
| MiniMax-M2.7 |  tg128 @ d131072 |    18.47 ± 0.13 | 20.67 ± 0.47 |                    |                    |                    |

llama-benchy (0.3.5)
date: 2026-04-13 11:45:25 | latency mode: api

Recipe (slightly changed in comparison to @miken):

vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
    --host 0.0.0.0 \
    --port 8888 \
    --max-model-len 196608 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4 \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --served-model-name MiniMax-M2.7 \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2 \
    --attention-backend flashinfer \
    --override-generation-config "{\"top_k\": 40, \"top_p\": 0.95, \"temperature\": 1.0, \"min_p\": 0.01}" \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --disable-custom-all-reduce \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --load-format fastsafetensors \
    --distributed-executor-backend ray

Environment variables:

VLLM_USE_FLASHINFER_MOE_FP16=1
VLLM_USE_DEEP_GEMM=0
VLLM_USE_FLASHINFER_SAMPLER=0
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_FLOAT32_MATMUL_PRECISION=high
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
OMP_NUM_THREADS=8

vedcsolution · April 13, 2026, 10:03am

Thanks for the update to the model, I was waiting for it, thanks for the new recipes

co-le · April 13, 2026, 12:02pm

2x Asus Ascent GX10, performance very similar to M2.5 (which makes sense, basically same model, same size).

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048	3121.55 ± 32.45		779.28 ± 6.82	656.16 ± 6.82	779.35 ± 6.82
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32	41.60 ± 0.06	42.94 ± 0.07
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d4096	2642.58 ± 6.81		2448.14 ± 5.98	2325.02 ± 5.98	2448.21 ± 5.98
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d4096	39.73 ± 0.04	41.02 ± 0.04
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d8192	2456.91 ± 3.91		4290.97 ± 6.63	4167.85 ± 6.63	4291.04 ± 6.63
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d8192	38.56 ± 0.06	39.81 ± 0.06
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d16384	2196.05 ± 1.09		8516.37 ± 4.16	8393.25 ± 4.16	8516.44 ± 4.16
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d16384	35.67 ± 0.04	36.83 ± 0.04
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d32768	1815.85 ± 2.53		19296.54 ± 26.75	19173.42 ± 26.75	19296.61 ± 26.74
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d32768	31.35 ± 0.17	32.36 ± 0.17
cyankiwi/MiniMax-M2.7-AWQ-4bit	pp2048 @ d100000	1047.93 ± 1.09		97504.06 ± 101.52	97380.94 ± 101.52	97504.14 ± 101.53
cyankiwi/MiniMax-M2.7-AWQ-4bit	tg32 @ d100000	21.20 ± 0.05	22.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-04-13 14:54:14 | latency mode: generation

To make it work I just updated the 2.5 to 2.7 in the recipe. Here is my version for max context:

spark-vllm-docker/recipes/minimax-m2.7-awq.yaml

# Recipe: MiniMax-M2.7-AWQ
# MiniMax M2.7 model with AWQ quantization

recipe_version: "1"
name: MiniMax-M2.7-AWQ
description: vLLM serving MiniMax-M2.7-AWQ with Ray distributed backend

# HuggingFace model to download (optional, for --download-model)
model: cyankiwi/MiniMax-M2.7-AWQ-4bit

# Container image to use
container: vllm-node

# Can only be run in a cluster
cluster_only: true

# No mods required
mods: []

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.9

# Environment variables
env: {}

# The vLLM serve command template
command: |
  vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
      --trust-remote-code \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2 \
      --kv-cache-dtype fp8_e4m3

(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] 
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.2rc1.dev74+g71a9125c6.d20260403
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]   █▄█▀ █     █     █     █  model   cyankiwi/MiniMax-M2.7-AWQ-4bit
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] 
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:233] non-default args: {'model_tag': 'cyankiwi/MiniMax-M2.7-AWQ-4bit', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m2', 'host': '0.0.0.0', 'model': 'cyankiwi/MiniMax-M2.7-AWQ-4bit', 'trust_remote_code': True, 'load_format': 'fastsafetensors', 'reasoning_parser': 'minimax_m2', 'master_addr': '192.168.177.11', 'nnodes': 2, 'tensor_parallel_size': 2, 'kv_cache_dtype': 'fp8_e4m3'}
(APIServer pid=39) WARNING 04-13 11:07:19 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=39) INFO 04-13 11:07:21 [model.py:549] Resolved architecture: MiniMaxM2ForCausalLM
(APIServer pid=39) INFO 04-13 11:07:21 [model.py:1680] Using max model len 196608
(APIServer pid=39) INFO 04-13 11:07:22 [cache.py:253] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=39) INFO 04-13 11:07:22 [arg_utils.py:1724] Inferred data_parallel_rank 0 from node_rank 0
(APIServer pid=39) INFO 04-13 11:07:22 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=39) INFO 04-13 11:07:22 [kernel.py:196] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py:1984: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
  @torch._dynamo.allow_in_graph
(EngineCore pid=92) INFO 04-13 11:07:26 [core.py:105] Initializing a V1 LLM engine (v0.18.2rc1.dev74+g71a9125c6.d20260403) with config: model='cyankiwi/MiniMax-M2.7-AWQ-4bit', speculative_config=None, tokenizer='cyankiwi/MiniMax-M2.7-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=196608, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='minimax_m2', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=cyankiwi/MiniMax-M2.7-AWQ-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')

Keyper-AI · April 13, 2026, 1:59pm

So if I am understanding the benchmarks right, it’s a little bit smarter than Qwen 3.5 397B, but also a little bit slower? You also lose out on image capabilities with Minimax 2.7.

I’ve been super impressed with Qwen 3.5 and it’s been my daily driver. My only complaint is speed.

serapis · April 13, 2026, 2:31pm

It’s far more reliable for tool calling – it also seems to be more consistent for agentic coding (I tested it with OpenClaw, Hermes, Pi, OpenCode). It doesn’t do vision and has less context.

Ultimately you may want to test both and see which works better for you.

carlos.albarran.mx · April 13, 2026, 2:59pm

Any chances that his will run on a single Spark with REAP like:

Topic		Replies	Views
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	92	6364	April 12, 2026
MiniMax 2.5 REAP - NVFP4 on single DGX Spark DGX Spark / GB10	25	3047	April 1, 2026
Can someone with 2 Sparks benchmark NVFP4 MiniMax M2.1 quant? DGX Spark / GB10	25	1487	January 29, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2504	December 25, 2025
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	5862	May 4, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1521	February 13, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	8047	March 28, 2026
MiniMax-2.5 on DGX Spark (thanks to Unsloth https://unsloth.ai/docs/models/minimax-2.5) DGX Spark / GB10 llama	12	3821	February 20, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	12144	May 15, 2026
MiMo-V2.5-NVFP4 on 2x Spark Cluster - Recipe, findings, fixes, benchmarks DGX Spark / GB10	32	1725	May 29, 2026

MiniMax M2.7 NFVP4 Recipe & Benchmarks

Related topics