MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready?

eugr · February 16, 2026, 7:10pm

Numbers for cyankiwi AWQ quant:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/MiniMax-M2.5-AWQ-4bit	pp2048	2378.68 ± 329.36		885.51 ± 134.44	879.45 ± 134.44	885.70 ± 134.40
cyankiwi/MiniMax-M2.5-AWQ-4bit	tg128	36.74 ± 0.16	38.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_pp @ d4096	2776.22 ± 24.92		1481.57 ± 13.17	1475.51 ± 13.17	1481.74 ± 13.21
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_tg @ d4096	35.39 ± 0.02	37.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit	pp2048 @ d4096	2032.33 ± 95.75		1015.98 ± 46.81	1009.91 ± 46.81	1016.12 ± 46.89
cyankiwi/MiniMax-M2.5-AWQ-4bit	tg128 @ d4096	34.88 ± 0.58	37.00 ± 1.41
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_pp @ d8192	1845.25 ± 546.49		4792.78 ± 1172.43	4786.72 ± 1172.43	4792.88 ± 1172.42
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_tg @ d8192	27.05 ± 7.48	28.33 ± 7.32
cyankiwi/MiniMax-M2.5-AWQ-4bit	pp2048 @ d8192	1796.84 ± 201.95		1161.65 ± 140.68	1155.59 ± 140.68	1161.79 ± 140.70
cyankiwi/MiniMax-M2.5-AWQ-4bit	tg128 @ d8192	27.95 ± 4.57	32.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_pp @ d16384	2382.49 ± 5.86		6882.94 ± 16.90	6876.87 ± 16.90	6883.07 ± 16.88
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_tg @ d16384	28.15 ± 0.15	29.33 ± 0.47
cyankiwi/MiniMax-M2.5-AWQ-4bit	pp2048 @ d16384	1610.59 ± 6.17		1277.67 ± 4.88	1271.60 ± 4.88	1277.79 ± 4.86
cyankiwi/MiniMax-M2.5-AWQ-4bit	tg128 @ d16384	27.47 ± 0.05	28.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_pp @ d32068	1964.46 ± 1.91		16330.14 ± 15.88	16324.07 ± 15.88	16330.23 ± 15.88
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_tg @ d32068	22.64 ± 0.03	23.33 ± 0.47
cyankiwi/MiniMax-M2.5-AWQ-4bit	pp2048 @ d32068	1191.62 ± 7.80		1724.81 ± 11.30	1718.74 ± 11.30	1724.91 ± 11.29
cyankiwi/MiniMax-M2.5-AWQ-4bit	tg128 @ d32068	20.39 ± 2.64	23.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_pp @ d65535	1395.77 ± 49.48		47019.26 ± 1708.67	47013.20 ± 1708.67	47019.52 ± 1708.71
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_tg @ d65535	15.95 ± 0.02	17.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit	pp2048 @ d65535	768.62 ± 4.64		2670.66 ± 16.04	2664.60 ± 16.04	2670.77 ± 16.01
cyankiwi/MiniMax-M2.5-AWQ-4bit	tg128 @ d65535	15.70 ± 0.03	17.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_pp @ d100000	1086.12 ± 32.25		92159.40 ± 2791.20	92153.33 ± 2791.20	92159.76 ± 2791.09
cyankiwi/MiniMax-M2.5-AWQ-4bit	ctx_tg @ d100000	12.33 ± 0.01	13.33 ± 0.47
cyankiwi/MiniMax-M2.5-AWQ-4bit	pp2048 @ d100000	562.80 ± 5.01		3645.28 ± 32.60	3639.22 ± 32.60	3645.39 ± 32.60
cyankiwi/MiniMax-M2.5-AWQ-4bit	tg128 @ d100000	11.55 ± 0.88	13.00 ± 0.00

llama-benchy (0.3.1)
date: 2026-02-16 10:56:20 | latency mode: api

eugr · February 16, 2026, 7:39pm

Added a recipe for this model to the repo: minimax-m2.5-awq

ellie.herring · February 16, 2026, 9:39pm

Hi!

I’m new to working with the DGX spark and vLLM, and it seems y’all are doing some great work on supporting SOTA models. Managed to get unsloth’s MiniMax-2.5-GGUF with their so-called “4bit dynamic” Q3_K_XL quant running with llama.cpp’s cli on a single DGX spark. Throughput was good at around 20tps and it had good answers to difficult coding & logic questions.

I was wondering if there is a particular reason to opt for the minimax-m2.5-awq quant over the Q3_K_XL option (other than literally just less quantization)? If not, for us single DGX users, would is be feasible to write a recipe for this version of M2.5?

eugr · February 16, 2026, 9:41pm

For a single Spark, Q3_K_XL or lower is the only way to go, and that’s with llama.cpp, as vLLM won’t run it.

tatamiso · February 17, 2026, 2:31am

@eugr I did a git pull on your repo and tried to rebuild with latest changes but got this error:

./build-and-copy.sh

767.9 Traceback (most recent call last):
767.9 File “”, line 11, in
767.9 File “/workspace/flashinfer/flashinfer-cubin/build_backend.py”, line 100, in build_wheel
767.9 _download_cubins()
767.9 File “/workspace/flashinfer/flashinfer-cubin/build_backend.py”, line 33, in _download_cubins
767.9 download_artifacts()
767.9 File “/workspace/flashinfer/flashinfer/artifacts.py”, line 223, in download_artifacts
767.9 raise RuntimeError(“Failed to download cubins”)
767.9 RuntimeError: Failed to download cubins
767.9 × Failed to build /workspace/flashinfer/flashinfer-cubin
767.9 ├─▶ The build backend returned an error
767.9 ╰─▶ Call to build_backend.build_wheel failed (exit status: 1)
767.9 hint: This usually indicates a problem with the package or the build
767.9 environment.
767.9 DEBUG Released lock at /root/.cache/uv/.lock

ERROR: failed to build: failed to solve: process “/bin/sh -c cd flashinfer-cubin && uv build --no-build-isolation --wheel . --out-dir=/workspace/wheels -v” did not complete successfully: exit code: 2

eugr · February 17, 2026, 2:41am

Where are you located? A few people reported about issues downloading cubins from NVIDIA servers outside of the US. In this case repeated attempts to build will eventually get you there as I have implemented caching of downloaded cubins (something that Flashinfer doesn’t do by default for some reason).

See this post for details: Does nvidia.com rate-limit legitimate requests associated with building flashinfer from source?

I’m in the process of separating flashinfer build and main docker build, so you will be able to just reuse nightly or stable wheels that I will generate and put in the repo.

tatamiso · February 17, 2026, 2:45am

Ahh ok. I’m in Germany. I could try again or vpn and see if that works. Thank you.

Edit: worked on the 3rd run!

Again great work, thank you!!

raphael.amorim · February 17, 2026, 8:52pm

Btw, published the benchmarks for the spark-vllm-docker for Minimax M2.5 on 2 sparks yesterday

flash3 · February 17, 2026, 9:10pm

Anyone pruned it ( REAP ) by 10% ?

helge · February 18, 2026, 2:13am

Got the 3-Bit gguf quantisation from unsloth running on a Jetson Thor. It requires an actual build of llama-cpp (found no way to get the gguf running with vllm).

git clone https://github.com/ggml-org/llama.cpp.git ~/llama-cpp/llama.cpp
cd ~/llama-cpp/llama.cpp
export PATH=/usr/local/cuda/bin:$PATH
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=110
cmake --build build --config Release -j$(nproc)

The CUDA architecture 110 corresponds to the Jetson Thor’s Blackwell GPU (compute capability 11.0), I guess it will be the same for DGX Spark. EDIT: It’s 12.1 on Spark, so use 121.

If you use the IQ3_XXS variant of the model, it requires around 93 GB RAM and leaves enough space to use the full context of 196K token (in that case around 118 GB RAM is used).

On the Jetson Thor, I get around 22 t/s, so I would expect around 25 - 28 t/s on a DGX spark.

I could successfully add it into VS Code Insiders using its agent model manager for generic models with OpenAI compatible API. Tool use works fine and generated results are quite well so far. Not as close to Opus 4.6 as claimed by their model card but definitely usable.

Pretty amazing that this is possible now on such a small machine!

eugr · February 18, 2026, 2:39am

DGX Spark is sm121 (12.1).

PrinceHal · February 18, 2026, 3:28am

I downloaded the QuantTrio/MiniMax-M2.5-AWQ model today and am running it on my stacked Sparks using @eugr 's latest repository code. It’s working quite well and I am impressed with the quality of the coding, particularly in Xcode as well as VSCode. It may be my new go-to coder. Once things are warmed up it responds very quickly in Xcode.

I ran it thus:

./launch-cluster.sh -n 169.254.246.76,169.254.28.198 -d exec vllm serve QuantTrio/MiniMax-M2.5-AWQ --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 128000 --load-format fastsafetensors --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --trust-remote-code

I wonder if anyone can comment on the significance of the warnings I get when running the model in vllm:

 Unknown vLLM environment variable detected: VLLM_BASE_DIR
 Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
 Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.
 tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 291f13266a7e3b028a49f93727aa7e18968a2c1a877e18477f653d6c. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
 tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node c9b1653d1318944a2df4915722efe04c0d07d8159b0da604026fb2b3. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
 Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
 Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
 SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
 Custom allreduce is disabled because this process group spans across nodes.
 Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
 Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
 SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
 Custom allreduce is disabled because this process group spans across nodes.
 Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=768,device_name=NVIDIA_GB10.json
 Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=768,device_name=NVIDIA_GB10.json
 Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 40, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.

Obviously most are trivial, which is nice, and there are some duplicates because two Sparks.

I’m also puzzled that it won’t work without --trust-remote-code and yet:
(APIServer pid=1251) The argument trust_remote_codeis to be used with Auto classes. It has no effect here and is ignored.

My undying thanks to @eugr and the crew.

eugr · February 18, 2026, 5:34am

The warnings are normal, and can be safely ignored here. Basically it warns that tensor parallel will not be performing well unless you have fast interconnect (which you have). The others are just either related to cluster config (allreduce, etc) or model config (sampling params), and can also be ignored.

johnny.c.chung · February 26, 2026, 12:43pm

Title: GB10 (Project DIGITS) — OOM during sglang model loading after driver upgrade 580→590

Body:

Hardware: 2x NVIDIA GB10 (Project DIGITS), 128GB unified memory each (119GB usable)

Driver: 590.48.01 (was working on 580.126.09)

Kernel: 6.14.0-1015-nvidia (Ubuntu, aarch64)

Workload: Distributed inference with sglang 0.5.7 (nvcr.io/nvidia/sglang:26.01-py3), serving MiniMax-M2.5-FP8-INT4-AWQ (456B MoE, ~123GB

quantized on disk) across both nodes using TP=2, EP=2, nnodes=2.

Problem:

After upgrading both machines from driver 580.126.09 to 590.48.01, sglang consistently OOMs during model loading. The model was loading

and serving fine on driver 580 with the exact same container, model, and launch parameters.

The model uses CompressedTensorsWNA16MarlinMoEMethod, which loads pre-cached Marlin-repacked weights from disk (~100GB per node). During

the Marlin cache loading phase, memory climbs to 119/119GB and the system becomes unresponsive, eventually triggering OOM kills or a full

system hang requiring hard reboot.

On driver 580, the same loading process completed with roughly 5-7GB of headroom. Driver 590 appears to consume more memory on GB10

unified memory, eliminating that margin.

What I’ve tried:

Adding --cpu-offload-gb 30-50 to move weights to swappable CPU memory
Adding 280-513GB of SSD swap per node with vm.swappiness=90
Setting --max-total-tokens 4096 to minimize KV cache
Disabling cuda graph, custom all-reduce
Setting PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.5,expandable_segments:True,max_split_size_mb:512

With these workarounds, one node (50GB offload) completed loading with 1.06GB free, but the other node locked up from swap thrashing (90%

iowait) and became unresponsive after ~53 minutes.

Steps to reproduce:

1. GB10 with driver 590.48.01, kernel 6.14.0-1015-nvidia

2. Launch sglang with a model that uses ~110GB of Marlin-repacked weights per node (out of 119GB available)

3. Model loading OOMs or system hangs — same config works on driver 580.126.09

Questions:

1. Did the driver 590 memory management (CDMM or otherwise) change how unified memory is allocated on GB10?

2. Is there a known increase in driver memory overhead on GB10 between 580 and 590?

3. Is there a recommended driver version for GB10 inference workloads?

zrjzhuruijie · February 27, 2026, 2:15am

IN my test，2 dgxspark 128k meet 22tokens/s

cho · March 1, 2026, 3:00pm

don’t use that 590 driver , we all know it have memory leak problem.

fishnotphish · March 15, 2026, 7:28pm

Does anyone have a good recipe that’s stable on dual Sparks for Minimax with full context? I’ve been using Eugr’s recipe (which is fantastic and very stable) but it only has 120K context and I’d really like to get to that 200K mark (196K technically I believe is the max context window). Not even sure how much more I can squeeze out of the dual sparks, as they sit at ~100GB consumed memory a piece with just 120K context window with the recipe.

./launch-cluster.sh -t vllm-node -d start

docker exec vllm_node bash -i -c "vllm serve /models/cyankiwi-MiniMax-M2.5-AWQ-4bit \
    --trust-remote-code \
    --port 8000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.7 \
    -tp 2 \
    --distributed-executor-backend ray \
    --served-model-name minimax-m2.5-awq \
    --max-model-len 128000 \
    --load-format fastsafetensors \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2 \
    > /proc/1/fd/1 2> /proc/1/fd/2"

I guess I can increase the gpu-memory-utilization but is it safe/stable to have it at 8.5 or 9? Asking because I keep vllm running all the time and never bring the container down unless I want to switch models.

jwarner · March 15, 2026, 7:48pm

In my experience with headless single unit I can go rock solid to 0.93 at least.

agustinr · March 16, 2026, 12:47am

I use this

spark:~/repos/spark-vllm-docker git/main*
❯ cat recipes/minimax-m2.5-awq-256k.yaml

# Recipe: MiniMax-M2.5-AWQ
# MiniMax M2.5 model with AWQ quantization

recipe_version: "1"
name: MiniMax-M2.5-AWQ
description: vLLM serving MiniMax-M2.5-AWQ with Ray distributed backend

# HuggingFace model to download (optional, for --download-model)
model: cyankiwi/MiniMax-M2.5-AWQ-4bit

# Container image to use
container: vllm-node

# Can only be run in a cluster
cluster_only: true

# No mods required
mods: []

# Default settings (can be overridden via CLI)
defaults:
  port: 8124
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.80
  max_model_len: 196600

# Environment variables
env: {}

# The vLLM serve command template
command: |
  vllm serve cyankiwi/MiniMax-M2.5-AWQ-4bit \
      --trust-remote-code \
      --port {port} \
      --host {host} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      -tp {tensor_parallel} \
      --distributed-executor-backend ray \
      --max-model-len {max_model_len} \
      --load-format fastsafetensors \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2

eugr · March 16, 2026, 8:17pm

This model should be able to fit 256K context: Intel/MiniMax-M2.5-int4-AutoRound · Hugging Face
I haven’t tried it yet, will try when have time and publish a recipe.

Topic		Replies	Views
MiniMax-2.5 on DGX Spark (thanks to Unsloth https://unsloth.ai/docs/models/minimax-2.5) DGX Spark / GB10 llama	12	3453	February 20, 2026
DGX Spark performance DGX Spark / GB10	50	4021	February 27, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2270	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1191	February 13, 2026
MiniMax 2.5 REAP - NVFP4 on single DGX Spark DGX Spark / GB10	25	2637	April 1, 2026
Anyone have any luck running MiniMaxAi/MiniMax-M2 on a cluster DGX Spark? DGX Spark / GB10	9	1242	December 14, 2025
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3847	March 6, 2026
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	3819	January 2, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14648	March 24, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6417	March 28, 2026

MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready?

Related topics