MiniMax M2.7 NFVP4 Recipe & Benchmarks

I had a chance to play with lukealonso/MiniMax-M2.7-NVFP4 and wanted to share my very first results (Dual Node Setup - 2x Asus Ascent GX10). I let vLLM calculate context and was able to get 196608 – not too terrible.

Benchmarks:

| model        |            test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------|----------------:|-----------------:|-------------:|------------------:|------------------:|------------------:|
| MiniMax-M2.7 |          pp2048 | 2074.25 ± 223.23 |              |   865.36 ± 108.23 |   863.94 ± 108.23 |   865.42 ± 108.22 |
| MiniMax-M2.7 |           tg128 |     24.30 ± 0.02 | 25.00 ± 0.00 |                   |                   |                   |
| MiniMax-M2.7 |  pp2048 @ d4096 | 2249.52 ± 305.90 |              |  2377.95 ± 302.17 |  2376.52 ± 302.17 |  2378.01 ± 302.18 |
| MiniMax-M2.7 |   tg128 @ d4096 |     23.75 ± 0.08 | 24.33 ± 0.47 |                   |                   |                   |
| MiniMax-M2.7 |  pp2048 @ d8192 | 2146.48 ± 526.28 |              | 4515.71 ± 1247.48 | 4514.29 ± 1247.48 | 4515.80 ± 1247.48 |
| MiniMax-M2.7 |   tg128 @ d8192 |     22.92 ± 0.14 | 24.00 ± 0.82 |                   |                   |                   |
| MiniMax-M2.7 | pp2048 @ d16384 |   2471.71 ± 7.59 |              |   6474.06 ± 67.79 |   6472.63 ± 67.79 |   6474.11 ± 67.78 |
| MiniMax-M2.7 |  tg128 @ d16384 |     21.78 ± 0.41 | 23.00 ± 0.00 |                   |                   |                   |

llama-benchy (0.3.5)
date: 2026-04-12 20:43:17 | latency mode: api

My config:

vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
    --host 0.0.0.0 \
    --port 8888 \
    --max-num-seqs 4 \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.92 \
    --mamba_ssm_cache_dtype float32 \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --served-model-name MiniMax-M2.7 \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --kv-cache-dtype fp8 \
    --quantization modelopt_fp4 \
    --moe-backend cutlass \
    --disable-custom-all-reduce \
    --dtype auto \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

Environment variables used:

VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=cutlass
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
OMP_NUM_THREADS=8

I run this via @eugr’s spark-vllm-docker with the TF5 container.

I’ll play around with other config options over the coming days but am pretty happy with the result for a first try. Very keen to see what other people achieve, too!

Cheers

Could you please test the llama-bench model with the following parameters:

–pp 1000 --tg 128 --depth 1000 10000 15000 20000 30000 40000 70000 100000 --concurrency 1 2

@joshua.dale.warner (https://forums.developer.nvidia.com/u/joshua.dale.warner)

mrtime

3h

Everyone considering this model needs to very, very carefully look at the modified license MM2.7 was released under.

MM2.7 is NOT open licensed. It is now fundamentally under a non-commercial license. If you make any money with it, or using derivatives of its output, you are opened up to be sued into oblivion.

Claiming this is “open source” is a travesty and wildly dishonest by MiniMax.

Worse, in the repo commit history it had a proper license and then they changed it seemingly mere moments before release.

It would be interesting if the community backwards engineered their claimed improvements from and using MM2.5 - because apparently self-improvement was a huge part of the evolution to MM2.7.

They used what sounds like Autoresearch or a similar harness. We could basically fork the model family from where it was open licensed. I don’t know how it’s controlled, but it takes away any desire to try it. Maybe they’ll release an open-source version, I don’t know.

I’ve been playing with it as well, reposting my post from the 2.5 thread below. Your numbers look better than mine so I’ll update my recipe with your settings. Are you running with or without ray?

Repost below:


The M2.7 release probably needs it’s own thread, but I’ll post this here for now. I’m testing the lukealonso/MiniMax-M2.7-NVFP4 quant on my 2x Asus Gx10 cluster and initial results are promising. I tried first with Cutlass backend:

recipe_version: "1"
name: MiniMax-M2.7-NVFP4-CUTLASS
description: vLLM serving MiniMax-M2.7-NVFP4 using stable CUTLASS NvFP4 backend on GB10/SM121

model: lukealonso/MiniMax-M2.7-NVFP4
container: vllm-node-tf5
cluster_only: false
mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.80
  max_model_len: 196000
  max_num_seqs: 5

env:
  VLLM_NVFP4_GEMM_BACKEND: "cutlass"
  VLLM_USE_FLASHINFER_MOE_FP4: 0
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1

command: |
  vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
      --trust-remote-code \
      --kv-cache-dtype fp8 \
      --moe-backend cutlass \
      --attention-backend TRITON_ATTN \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --max-num-seqs {max_num_seqs} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think

A quick benchmark gave this:

pp=2048, tg=32, depths 0/4096/8192/16384/32768, runs=3, latency-mode=generation

CUTLASS
Latency: 311.2 ms

| Depth | PP tok/s | TG tok/s | E2E TTFT |
| ----- | -------- | -------- | -------- |
| 0     | 2463     | 17.75    | 1.09s    |
| 4096  | 1848     | 11.91    | 3.44s    |
| 8192  | 1601     | 11.71    | 6.32s    |
| 16384 | 1257     | 10.49    | 14.10s   |
| 32768 | 864      | 9.48     | 38.39s 

I then switched the recipe to use flashinfer-cutlass instead:

# Recipe: MiniMax-M2.7-NVFP4 (FlashInfer test)
# Duplicated from the CUTLASS-stable MiniMax M2.7 recipe for A/B testing.
# Keep NVFP4 GEMM on CUTLASS, but switch attention + MoE FP4 path to FlashInfer.
# Also explicitly disable TRT-LLM attention path per Eugr's SM100-only note.

recipe_version: "1"
name: MiniMax-M2.7-NVFP4-FlashInfer-Test
description: Experimental vLLM serving MiniMax-M2.7-NVFP4 with FlashInfer attention and FlashInfer MoE FP4 test path on GB10/SM121

model: lukealonso/MiniMax-M2.7-NVFP4
container: vllm-node-tf5
cluster_only: false
mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.80
  max_model_len: 196000
  max_num_seqs: 5

env:
  VLLM_NVFP4_GEMM_BACKEND: "flashinfer-cutlass"
  VLLM_USE_FLASHINFER_MOE_FP4: 1
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1

command: |
  vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
      --trust-remote-code \
      --kv-cache-dtype fp8 \
      --attention-backend flashinfer \
      --attention-config.use_trtllm_attention=0 \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --max-num-seqs {max_num_seqs} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think

Benchy results were now much better:

FLASHINFER
Latency: 166.7 ms

| Depth | PP tok/s | TG tok/s | E2E TTFT |
| ----- | -------- | -------- | -------- |
| 0     | 2904     | 20.87    | 0.83s    |
| 4096  | 2478     | 20.44    | 2.48s    |
| 8192  | 2295     | 19.99    | 4.38s    |
| 16384 | 2046     | 19.13    | 8.68s    |
| 32768 | 1739     | 17.84    | 19.01s   |

I also ran ToolCall15 which it aced, unlike the 397b which scored 27/30:

| Backend    | Points | Final score | Rating          | Non-pass cases |
| ---------- | ------ | ----------- | --------------- | -------------- |
| CUTLASS    | 30/30  | 100         | ★★★★★ Excellent | 0              |
| FLASHINFER | 30/30  | 100         | ★★★★★ Excellent | 0              |

| Category            | Score |
| ------------------- | ----- |
| Tool Selection      | 100%  |
| Parameter Precision | 100%  |
| Multi-Step Chains   | 100%  |
| Restraint & Refusal | 100%  |
| Error Recovery      | 100%  |

For reference this is what I got on the 397b:

| Category            | Score |
| ------------------- | ----- |
| Tool Selection      | 100%  |
| Parameter Precision | 100%  |
| Multi-Step Chains   | 100%  |
| Restraint & Refusal | 83%   |
| Error Recovery      | 67%   |

Non-pass notes
TC-11: used the calculator unnecessarily for simple math
TC-15: did not preserve the exact searched value across tool calls

Thanks for sharing! I’ll spend more time optimizing things tomorrow. We may still get a little bit more squeezed out of this model. I hope I can get 262k context to work, too.

MiniMax advertises M2.7 as having a context window of 200K. I would be cautious about anything more than that for anything that actually matters.

This is what I’m seeing across 4 Sparks with NVFP4.

(APIServer pid=900) INFO 04-12 23:53:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:37 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:47 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=900) INFO 04-12 23:53:57 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%

I see Unsloth posted an FP8. I’m going to give that one a spin now.

And this is what I’m seeing running Unsloth’s FP8 across 4 Sparks.

(APIServer pid=905) INFO 04-13 01:26:11 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=905) INFO 04-13 01:26:21 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%

Kind of surprised. I’m actually seeing not only no degradation in throughput, but an increase in throughput, even though I switched from NVFP4 to the full FP8 model. And when I have a cache hit:

(APIServer pid=905) INFO 04-13 01:33:31 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 44.1%
(APIServer pid=905) INFO 04-13 01:33:41 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 44.1%

FYI, Flashinfer-cutlass (the default NVFP4 backend) seems to be stable enough now that I’m planning to switch all my NVFP4 recipes over to it from VLLM_CUTLASS. The autotuner exceptions are now fixed, and some minor optimizations have been merged into flashinfer recently.

I pulled the latest wheels of spark-vllm-docker yesterday and reranmy benchmarks as well as some additional tests, and flashinfer-cutlass got a nice boost now that autotune runs without errors: earlier I was using 0.19.1rc1.dev71+gdd9342e6b.d20260410, now I’m on 0.19.1rc1.dev219+g72ff142c3.d20260412.

I ran a quick test matrix, forum-cutlass is @serapis settings posted above. Bear in mind my headnode is limited to 2150mhz so it doesn’t freeze.

cell latency ms avg PP avg TG 32k TTFT s read
forum-cutlass, Ray 122.13 2312 22.04 17.55 baseline
forum-cutlass, no-Ray 140.59 2364 22.99 17.50 slower, tiny throughput bump
flashinfer-cutlass + throughput, Ray 111.29 3003 22.98 14.44 best latency
flashinfer-cutlass + throughput, no-Ray 122.52 3065 24.12 14.30 best overall profile
flashinfer-trtllm + latency, Ray 114.49 3033 22.74 14.50 close, not better
flashinfer-trtllm + latency, no-Ray 132.75 3078 23.90 14.43 close, not better

Results for the winner:

depth PP tok/s TG tok/s TTFT s
0 3368 25.81 0.691
4096 3647 25.19 1.699
8192 3174 24.57 3.152
16384 2825 23.50 6.279
32768 2309 21.51 14.304

And full settings for the recipe:

# Recipe: MiniMax-M2.7-NVFP4 (FlashInfer test)
# Duplicated from the CUTLASS-stable MiniMax M2.7 recipe for A/B testing.
# Keep NVFP4 GEMM on CUTLASS, but switch attention + MoE FP4 path to FlashInfer.
# Also explicitly disable TRT-LLM attention path per Eugr's SM100-only note.

recipe_version: "1"
name: MiniMax-M2.7-NVFP4-FlashInfer-Test
description: Experimental vLLM serving MiniMax-M2.7-NVFP4 with FlashInfer attention and FlashInfer MoE FP4 test path on GB10/SM121

model: lukealonso/MiniMax-M2.7-NVFP4
container: vllm-node-tf5
cluster_only: false
mods: []

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.8057
  max_model_len: 225000
  max_num_seqs: 5

env:
  VLLM_NVFP4_GEMM_BACKEND: flashinfer-cutlass
  VLLM_USE_FLASHINFER_MOE_FP4: 1
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
  OMP_NUM_THREADS: 8
  VLLM_FLOAT32_MATMUL_PRECISION: high
  VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: 1
  VLLM_FLASHINFER_MOE_BACKEND: throughput

command: |
  vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
      --trust-remote-code \
      --kv-cache-dtype fp8 \
      --dtype auto \
      --quantization modelopt_fp4 \
      --attention-backend flashinfer \
      --max-num-batched-tokens 8192 \
      --mamba_ssm_cache_dtype float32 \
      --disable-custom-all-reduce \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --max-num-seqs {max_num_seqs} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think

First impressions in Claude Code are good, it seems much more thorough than 397b and both ran and fixed some tests in my code that 397b had been ignoring ever since it wrote them. Prefix cache hit rate hovering around 95% so responses are very snappy.

int4-AutoRound turns out that this format is inferior to the NVFP4 format?
For the speed + quality + stability ratio, NVFP4 is currently the best option for DGX Spark?

In the context, important updates have been made to the NVFP4 format!

Wild reactions across the net:

From the makers:

More important for us in here I think:

…so as long as you pile up your Sparks at home and use it, you should be fine. 😅

I just tested MiniMax M2.7 on my dual Spark setup using:

  Image: vllm-node (eugr spark-vllm-docker, --rebuild-vllm --tf5)                                                                                                            
  vLLM: 0.19.1rc1.dev221                                                                                                                                                     
  Hardware: 2x DGX Spark, CX7 200Gbps direct connect                                                                                                      
                                                                                                                                                                             
  Environment:                                                                                                                                                               
    NCCL_IB_DISABLE=0                                                                                                                                                        
    NCCL_P2P_DISABLE=1                                      
    VLLM_USE_FLASHINFER_MOE_FP16=1
    VLLM_USE_DEEP_GEMM=0                                                                                                                                                     
    VLLM_USE_FLASHINFER_SAMPLER=0
    OMP_NUM_THREADS=4                                                                                                                                                        
                                                            
  vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \                                                                                                                                
      --trust-remote-code \
      --port 8000 \                                                                                                                                                          
      --host 0.0.0.0 \                                      
      --gpu-memory-utilization 0.85 \
      -tp 2 \                                                                                                                                                                
      --distributed-executor-backend ray \
      --max-model-len 196608 \                                                                                                                                               
      --load-format fastsafetensors \                       
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \                                                                                                                                        
      --reasoning-parser minimax_m2 \
      --attention-backend FLASHINFER    

Here are the Llama Benchy results:

  MiniMax-M2.7 AWQ 4-bit on 2x DGX Spark (TP=2, CX7 200Gbps)   
  vLLM 0.19.1rc1.dev221 (eugr spark-vllm-docker, --rebuild-vllm --tf5)                                                                                                       
  gpu-memory-utilization 0.85, max-model-len 196608, FlashInfer attention, fastsafetensors                                                                                   
                                                                                                                                                                             
  | model                          |             test |             t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |                      
  |:-------------------------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:|                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |           pp2048 |  2900.93 ± 3.91 |              |      707.52 ± 0.95 |      705.98 ± 0.95 |      707.59 ± 0.94 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |            tg128 |    38.32 ± 0.03 | 39.00 ± 0.00 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   pp2048 @ d4096 |  2788.71 ± 1.79 |              |     2204.70 ± 1.41 |     2203.17 ± 1.41 |     2204.78 ± 1.41 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |    tg128 @ d4096 |    34.89 ± 0.40 | 36.00 ± 0.00 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   pp2048 @ d8192 |  2656.71 ± 6.06 |              |     3855.83 ± 8.96 |     3854.29 ± 8.96 |     3855.92 ± 8.94 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |    tg128 @ d8192 |    32.83 ± 0.16 | 33.67 ± 0.47 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  pp2048 @ d16384 | 2389.09 ± 46.57 |              |   7719.58 ± 152.55 |   7718.05 ± 152.55 |   7719.66 ± 152.56 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   tg128 @ d16384 |    28.30 ± 0.23 | 29.00 ± 0.00 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  pp2048 @ d32768 | 2044.32 ± 13.42 |              |  17032.86 ± 112.31 |  17031.32 ± 112.31 |  17032.93 ± 112.31 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   tg128 @ d32768 |    22.68 ± 0.14 | 23.67 ± 0.47 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  pp2048 @ d65536 |  1570.44 ± 2.67 |              |   43036.74 ± 73.18 |   43035.20 ± 73.18 |   43036.80 ± 73.18 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |   tg128 @ d65536 |    16.14 ± 0.06 | 17.67 ± 0.47 |                    |                    |                    |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit | pp2048 @ d131072 |  1064.93 ± 3.69 |              | 125006.15 ± 434.25 | 125004.61 ± 434.25 | 125006.24 ± 434.25 |                      
  | cyankiwi/MiniMax-M2.7-AWQ-4bit |  tg128 @ d131072 |    10.28 ± 0.07 | 12.00 ± 0.00 |                    |                    |                    |                      
                                                                                                                                                                             
  llama-benchy (0.3.5)   

I also ran a small suite of OpenClaw usage tests I created, and it beat MiniMax 2.5, Qwen3.5 122B, Qwen3.5 397B and even Haiku 4.5.

I’m really pleased with it, although I am hoping the int4 Autoround will be a bit faster once it’s released.

So just wanted to say thanks to Eugr and the rest of the community for all their hard work!

This is for the nvfp4 one based on @ekkis with subtle tweaks:

| model        |             test |             t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:-------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:|
| MiniMax-M2.7 |           pp2048 | 3146.24 ± 16.60 |              |      581.68 ± 5.88 |      580.03 ± 5.88 |      581.76 ± 5.87 |
| MiniMax-M2.7 |            tg128 |    25.69 ± 0.08 | 26.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d4096 |  4136.09 ± 9.54 |              |    1304.96 ± 17.19 |    1303.31 ± 17.19 |    1305.02 ± 17.18 |
| MiniMax-M2.7 |    tg128 @ d4096 |    25.01 ± 0.05 | 26.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d8192 | 3532.93 ± 30.21 |              |     2571.29 ± 2.47 |     2569.64 ± 2.47 |     2571.36 ± 2.47 |
| MiniMax-M2.7 |    tg128 @ d8192 |    24.48 ± 0.05 | 25.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d16384 | 3401.70 ± 14.73 |              |    4731.97 ± 75.26 |    4730.32 ± 75.26 |    4732.05 ± 75.26 |
| MiniMax-M2.7 |   tg128 @ d16384 |    23.37 ± 0.08 | 24.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d32768 | 2638.56 ± 13.09 |              |  11539.58 ± 191.11 |  11537.93 ± 191.11 |  11539.64 ± 191.12 |
| MiniMax-M2.7 |   tg128 @ d32768 |    21.55 ± 0.06 | 22.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d65536 | 1815.83 ± 14.16 |              |  32460.81 ± 377.12 |  32459.16 ± 377.12 |  32460.87 ± 377.13 |
| MiniMax-M2.7 |   tg128 @ d65536 |    18.64 ± 0.11 | 20.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 | pp2048 @ d131072 |  1117.27 ± 6.17 |              | 104292.66 ± 704.48 | 104291.01 ± 704.48 | 104292.75 ± 704.52 |
| MiniMax-M2.7 |  tg128 @ d131072 |    11.26 ± 2.79 | 13.67 ± 1.70 |                    |                    |                    |

And here is the AWQ one – that seems to be a winner indeed:

| model        |             test |             t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:-------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:|
| MiniMax-M2.7 |           pp2048 | 2898.23 ± 28.92 |              |     625.28 ± 10.79 |     623.73 ± 10.79 |     625.32 ± 10.79 |
| MiniMax-M2.7 |            tg128 |    39.39 ± 0.06 | 40.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d4096 |  3570.38 ± 7.00 |              |    1494.05 ± 29.83 |    1492.50 ± 29.83 |    1494.11 ± 29.82 |
| MiniMax-M2.7 |    tg128 @ d4096 |    37.95 ± 0.10 | 39.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |   pp2048 @ d8192 | 3157.62 ± 23.87 |              |    2858.67 ± 14.45 |    2857.12 ± 14.45 |    2858.73 ± 14.45 |
| MiniMax-M2.7 |    tg128 @ d8192 |    37.04 ± 0.30 | 37.67 ± 0.47 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d16384 |  2995.08 ± 4.62 |              |    5259.94 ± 26.68 |    5258.39 ± 26.68 |    5259.99 ± 26.68 |
| MiniMax-M2.7 |   tg128 @ d16384 |    34.35 ± 0.13 | 35.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d32768 |  2380.92 ± 4.14 |              |   12728.98 ± 83.76 |   12727.43 ± 83.76 |   12729.05 ± 83.75 |
| MiniMax-M2.7 |   tg128 @ d32768 |    30.58 ± 0.09 | 31.33 ± 0.47 |                    |                    |                    |
| MiniMax-M2.7 |  pp2048 @ d65536 |  1696.77 ± 4.17 |              |  34787.96 ± 205.02 |  34786.41 ± 205.02 |  34788.04 ± 205.02 |
| MiniMax-M2.7 |   tg128 @ d65536 |    25.15 ± 0.14 | 27.00 ± 0.00 |                    |                    |                    |
| MiniMax-M2.7 | pp2048 @ d131072 |  1080.73 ± 2.02 |              | 107408.95 ± 586.57 | 107407.40 ± 586.57 | 107409.01 ± 586.57 |
| MiniMax-M2.7 |  tg128 @ d131072 |    18.47 ± 0.13 | 20.67 ± 0.47 |                    |                    |                    |

llama-benchy (0.3.5)
date: 2026-04-13 11:45:25 | latency mode: api

Recipe (slightly changed in comparison to @mikenott):

vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
    --host 0.0.0.0 \
    --port 8888 \
    --max-model-len 196608 \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 4 \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --served-model-name MiniMax-M2.7 \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2 \
    --attention-backend flashinfer \
    --override-generation-config "{\"top_k\": 40, \"top_p\": 0.95, \"temperature\": 1.0, \"min_p\": 0.01}" \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --disable-custom-all-reduce \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --load-format fastsafetensors \
    --distributed-executor-backend ray

Environment variables:

VLLM_USE_FLASHINFER_MOE_FP16=1
VLLM_USE_DEEP_GEMM=0
VLLM_USE_FLASHINFER_SAMPLER=0
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_FLOAT32_MATMUL_PRECISION=high
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
OMP_NUM_THREADS=8

Thanks for the update to the model, I was waiting for it, thanks for the new recipes

2x Asus Ascent GX10, performance very similar to M2.5 (which makes sense, basically same model, same size).

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/MiniMax-M2.7-AWQ-4bit pp2048 3121.55 ± 32.45 779.28 ± 6.82 656.16 ± 6.82 779.35 ± 6.82
cyankiwi/MiniMax-M2.7-AWQ-4bit tg32 41.60 ± 0.06 42.94 ± 0.07
cyankiwi/MiniMax-M2.7-AWQ-4bit pp2048 @ d4096 2642.58 ± 6.81 2448.14 ± 5.98 2325.02 ± 5.98 2448.21 ± 5.98
cyankiwi/MiniMax-M2.7-AWQ-4bit tg32 @ d4096 39.73 ± 0.04 41.02 ± 0.04
cyankiwi/MiniMax-M2.7-AWQ-4bit pp2048 @ d8192 2456.91 ± 3.91 4290.97 ± 6.63 4167.85 ± 6.63 4291.04 ± 6.63
cyankiwi/MiniMax-M2.7-AWQ-4bit tg32 @ d8192 38.56 ± 0.06 39.81 ± 0.06
cyankiwi/MiniMax-M2.7-AWQ-4bit pp2048 @ d16384 2196.05 ± 1.09 8516.37 ± 4.16 8393.25 ± 4.16 8516.44 ± 4.16
cyankiwi/MiniMax-M2.7-AWQ-4bit tg32 @ d16384 35.67 ± 0.04 36.83 ± 0.04
cyankiwi/MiniMax-M2.7-AWQ-4bit pp2048 @ d32768 1815.85 ± 2.53 19296.54 ± 26.75 19173.42 ± 26.75 19296.61 ± 26.74
cyankiwi/MiniMax-M2.7-AWQ-4bit tg32 @ d32768 31.35 ± 0.17 32.36 ± 0.17
cyankiwi/MiniMax-M2.7-AWQ-4bit pp2048 @ d100000 1047.93 ± 1.09 97504.06 ± 101.52 97380.94 ± 101.52 97504.14 ± 101.53
cyankiwi/MiniMax-M2.7-AWQ-4bit tg32 @ d100000 21.20 ± 0.05 22.00 ± 0.00

llama-benchy (0.3.5)
date: 2026-04-13 14:54:14 | latency mode: generation

To make it work I just updated the 2.5 to 2.7 in the recipe. Here is my version for max context:

spark-vllm-docker/recipes/minimax-m2.7-awq.yaml

# Recipe: MiniMax-M2.7-AWQ
# MiniMax M2.7 model with AWQ quantization

recipe_version: "1"
name: MiniMax-M2.7-AWQ
description: vLLM serving MiniMax-M2.7-AWQ with Ray distributed backend

# HuggingFace model to download (optional, for --download-model)
model: cyankiwi/MiniMax-M2.7-AWQ-4bit

# Container image to use
container: vllm-node

# Can only be run in a cluster
cluster_only: true

# No mods required
mods: []

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.9

# Environment variables
env: {}

# The vLLM serve command template
command: |
  vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
      --trust-remote-code \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2 \
      --kv-cache-dtype fp8_e4m3
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] 
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.2rc1.dev74+g71a9125c6.d20260403
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]   █▄█▀ █     █     █     █  model   cyankiwi/MiniMax-M2.7-AWQ-4bit
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] 
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:233] non-default args: {'model_tag': 'cyankiwi/MiniMax-M2.7-AWQ-4bit', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m2', 'host': '0.0.0.0', 'model': 'cyankiwi/MiniMax-M2.7-AWQ-4bit', 'trust_remote_code': True, 'load_format': 'fastsafetensors', 'reasoning_parser': 'minimax_m2', 'master_addr': '192.168.177.11', 'nnodes': 2, 'tensor_parallel_size': 2, 'kv_cache_dtype': 'fp8_e4m3'}
(APIServer pid=39) WARNING 04-13 11:07:19 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=39) INFO 04-13 11:07:21 [model.py:549] Resolved architecture: MiniMaxM2ForCausalLM
(APIServer pid=39) INFO 04-13 11:07:21 [model.py:1680] Using max model len 196608
(APIServer pid=39) INFO 04-13 11:07:22 [cache.py:253] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=39) INFO 04-13 11:07:22 [arg_utils.py:1724] Inferred data_parallel_rank 0 from node_rank 0
(APIServer pid=39) INFO 04-13 11:07:22 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=39) INFO 04-13 11:07:22 [kernel.py:196] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py:1984: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
  @torch._dynamo.allow_in_graph
(EngineCore pid=92) INFO 04-13 11:07:26 [core.py:105] Initializing a V1 LLM engine (v0.18.2rc1.dev74+g71a9125c6.d20260403) with config: model='cyankiwi/MiniMax-M2.7-AWQ-4bit', speculative_config=None, tokenizer='cyankiwi/MiniMax-M2.7-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=196608, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='minimax_m2', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=cyankiwi/MiniMax-M2.7-AWQ-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')

So if I am understanding the benchmarks right, it’s a little bit smarter than Qwen 3.5 397B, but also a little bit slower? You also lose out on image capabilities with Minimax 2.7.

I’ve been super impressed with Qwen 3.5 and it’s been my daily driver. My only complaint is speed.

It’s far more reliable for tool calling – it also seems to be more consistent for agentic coding (I tested it with OpenClaw, Hermes, Pi, OpenCode). It doesn’t do vision and has less context.

Ultimately you may want to test both and see which works better for you.

Any chances that his will run on a single Spark with REAP like: