2x Asus Ascent GX10, performance very similar to M2.5 (which makes sense, basically same model, same size).
| model |
test |
t/s |
peak t/s |
ttfr (ms) |
est_ppt (ms) |
e2e_ttft (ms) |
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
pp2048 |
3121.55 ± 32.45 |
|
779.28 ± 6.82 |
656.16 ± 6.82 |
779.35 ± 6.82 |
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
tg32 |
41.60 ± 0.06 |
42.94 ± 0.07 |
|
|
|
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
pp2048 @ d4096 |
2642.58 ± 6.81 |
|
2448.14 ± 5.98 |
2325.02 ± 5.98 |
2448.21 ± 5.98 |
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
tg32 @ d4096 |
39.73 ± 0.04 |
41.02 ± 0.04 |
|
|
|
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
pp2048 @ d8192 |
2456.91 ± 3.91 |
|
4290.97 ± 6.63 |
4167.85 ± 6.63 |
4291.04 ± 6.63 |
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
tg32 @ d8192 |
38.56 ± 0.06 |
39.81 ± 0.06 |
|
|
|
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
pp2048 @ d16384 |
2196.05 ± 1.09 |
|
8516.37 ± 4.16 |
8393.25 ± 4.16 |
8516.44 ± 4.16 |
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
tg32 @ d16384 |
35.67 ± 0.04 |
36.83 ± 0.04 |
|
|
|
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
pp2048 @ d32768 |
1815.85 ± 2.53 |
|
19296.54 ± 26.75 |
19173.42 ± 26.75 |
19296.61 ± 26.74 |
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
tg32 @ d32768 |
31.35 ± 0.17 |
32.36 ± 0.17 |
|
|
|
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
pp2048 @ d100000 |
1047.93 ± 1.09 |
|
97504.06 ± 101.52 |
97380.94 ± 101.52 |
97504.14 ± 101.53 |
| cyankiwi/MiniMax-M2.7-AWQ-4bit |
tg32 @ d100000 |
21.20 ± 0.05 |
22.00 ± 0.00 |
|
|
|
llama-benchy (0.3.5)
date: 2026-04-13 14:54:14 | latency mode: generation
To make it work I just updated the 2.5 to 2.7 in the recipe. Here is my version for max context:
spark-vllm-docker/recipes/minimax-m2.7-awq.yaml
# Recipe: MiniMax-M2.7-AWQ
# MiniMax M2.7 model with AWQ quantization
recipe_version: "1"
name: MiniMax-M2.7-AWQ
description: vLLM serving MiniMax-M2.7-AWQ with Ray distributed backend
# HuggingFace model to download (optional, for --download-model)
model: cyankiwi/MiniMax-M2.7-AWQ-4bit
# Container image to use
container: vllm-node
# Can only be run in a cluster
cluster_only: true
# No mods required
mods: []
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.9
# Environment variables
env: {}
# The vLLM serve command template
command: |
vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
--trust-remote-code \
--port {port} \
--host {host} \
--gpu-memory-utilization {gpu_memory_utilization} \
-tp {tensor_parallel} \
--distributed-executor-backend ray \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--kv-cache-dtype fp8_e4m3
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.2rc1.dev74+g71a9125c6.d20260403
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] █▄█▀ █ █ █ █ model cyankiwi/MiniMax-M2.7-AWQ-4bit
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:299]
(APIServer pid=39) INFO 04-13 11:07:19 [utils.py:233] non-default args: {'model_tag': 'cyankiwi/MiniMax-M2.7-AWQ-4bit', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m2', 'host': '0.0.0.0', 'model': 'cyankiwi/MiniMax-M2.7-AWQ-4bit', 'trust_remote_code': True, 'load_format': 'fastsafetensors', 'reasoning_parser': 'minimax_m2', 'master_addr': '192.168.177.11', 'nnodes': 2, 'tensor_parallel_size': 2, 'kv_cache_dtype': 'fp8_e4m3'}
(APIServer pid=39) WARNING 04-13 11:07:19 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=39) INFO 04-13 11:07:21 [model.py:549] Resolved architecture: MiniMaxM2ForCausalLM
(APIServer pid=39) INFO 04-13 11:07:21 [model.py:1680] Using max model len 196608
(APIServer pid=39) INFO 04-13 11:07:22 [cache.py:253] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=39) INFO 04-13 11:07:22 [arg_utils.py:1724] Inferred data_parallel_rank 0 from node_rank 0
(APIServer pid=39) INFO 04-13 11:07:22 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=39) INFO 04-13 11:07:22 [kernel.py:196] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py:1984: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
@torch._dynamo.allow_in_graph
(EngineCore pid=92) INFO 04-13 11:07:26 [core.py:105] Initializing a V1 LLM engine (v0.18.2rc1.dev74+g71a9125c6.d20260403) with config: model='cyankiwi/MiniMax-M2.7-AWQ-4bit', speculative_config=None, tokenizer='cyankiwi/MiniMax-M2.7-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=196608, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='minimax_m2', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=cyankiwi/MiniMax-M2.7-AWQ-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')