MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready?

Numbers for cyankiwi AWQ quant:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/MiniMax-M2.5-AWQ-4bit pp2048 2378.68 ± 329.36 885.51 ± 134.44 879.45 ± 134.44 885.70 ± 134.40
cyankiwi/MiniMax-M2.5-AWQ-4bit tg128 36.74 ± 0.16 38.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_pp @ d4096 2776.22 ± 24.92 1481.57 ± 13.17 1475.51 ± 13.17 1481.74 ± 13.21
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_tg @ d4096 35.39 ± 0.02 37.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit pp2048 @ d4096 2032.33 ± 95.75 1015.98 ± 46.81 1009.91 ± 46.81 1016.12 ± 46.89
cyankiwi/MiniMax-M2.5-AWQ-4bit tg128 @ d4096 34.88 ± 0.58 37.00 ± 1.41
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_pp @ d8192 1845.25 ± 546.49 4792.78 ± 1172.43 4786.72 ± 1172.43 4792.88 ± 1172.42
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_tg @ d8192 27.05 ± 7.48 28.33 ± 7.32
cyankiwi/MiniMax-M2.5-AWQ-4bit pp2048 @ d8192 1796.84 ± 201.95 1161.65 ± 140.68 1155.59 ± 140.68 1161.79 ± 140.70
cyankiwi/MiniMax-M2.5-AWQ-4bit tg128 @ d8192 27.95 ± 4.57 32.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_pp @ d16384 2382.49 ± 5.86 6882.94 ± 16.90 6876.87 ± 16.90 6883.07 ± 16.88
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_tg @ d16384 28.15 ± 0.15 29.33 ± 0.47
cyankiwi/MiniMax-M2.5-AWQ-4bit pp2048 @ d16384 1610.59 ± 6.17 1277.67 ± 4.88 1271.60 ± 4.88 1277.79 ± 4.86
cyankiwi/MiniMax-M2.5-AWQ-4bit tg128 @ d16384 27.47 ± 0.05 28.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_pp @ d32068 1964.46 ± 1.91 16330.14 ± 15.88 16324.07 ± 15.88 16330.23 ± 15.88
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_tg @ d32068 22.64 ± 0.03 23.33 ± 0.47
cyankiwi/MiniMax-M2.5-AWQ-4bit pp2048 @ d32068 1191.62 ± 7.80 1724.81 ± 11.30 1718.74 ± 11.30 1724.91 ± 11.29
cyankiwi/MiniMax-M2.5-AWQ-4bit tg128 @ d32068 20.39 ± 2.64 23.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_pp @ d65535 1395.77 ± 49.48 47019.26 ± 1708.67 47013.20 ± 1708.67 47019.52 ± 1708.71
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_tg @ d65535 15.95 ± 0.02 17.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit pp2048 @ d65535 768.62 ± 4.64 2670.66 ± 16.04 2664.60 ± 16.04 2670.77 ± 16.01
cyankiwi/MiniMax-M2.5-AWQ-4bit tg128 @ d65535 15.70 ± 0.03 17.00 ± 0.00
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_pp @ d100000 1086.12 ± 32.25 92159.40 ± 2791.20 92153.33 ± 2791.20 92159.76 ± 2791.09
cyankiwi/MiniMax-M2.5-AWQ-4bit ctx_tg @ d100000 12.33 ± 0.01 13.33 ± 0.47
cyankiwi/MiniMax-M2.5-AWQ-4bit pp2048 @ d100000 562.80 ± 5.01 3645.28 ± 32.60 3639.22 ± 32.60 3645.39 ± 32.60
cyankiwi/MiniMax-M2.5-AWQ-4bit tg128 @ d100000 11.55 ± 0.88 13.00 ± 0.00

llama-benchy (0.3.1)
date: 2026-02-16 10:56:20 | latency mode: api

2 Likes

Added a recipe for this model to the repo: minimax-m2.5-awq

3 Likes

Hi!

I’m new to working with the DGX spark and vLLM, and it seems y’all are doing some great work on supporting SOTA models. Managed to get unsloth’s MiniMax-2.5-GGUF with their so-called “4bit dynamic” Q3_K_XL quant running with llama.cpp’s cli on a single DGX spark. Throughput was good at around 20tps and it had good answers to difficult coding & logic questions.

I was wondering if there is a particular reason to opt for the minimax-m2.5-awq quant over the Q3_K_XL option (other than literally just less quantization)? If not, for us single DGX users, would is be feasible to write a recipe for this version of M2.5?

For a single Spark, Q3_K_XL or lower is the only way to go, and that’s with llama.cpp, as vLLM won’t run it.

1 Like

@eugr I did a git pull on your repo and tried to rebuild with latest changes but got this error:

./build-and-copy.sh

767.9 Traceback (most recent call last):
767.9 File “”, line 11, in
767.9 File “/workspace/flashinfer/flashinfer-cubin/build_backend.py”, line 100, in build_wheel
767.9 _download_cubins()
767.9 File “/workspace/flashinfer/flashinfer-cubin/build_backend.py”, line 33, in _download_cubins
767.9 download_artifacts()
767.9 File “/workspace/flashinfer/flashinfer/artifacts.py”, line 223, in download_artifacts
767.9 raise RuntimeError(“Failed to download cubins”)
767.9 RuntimeError: Failed to download cubins
767.9 × Failed to build /workspace/flashinfer/flashinfer-cubin
767.9 ├─▶ The build backend returned an error
767.9 ╰─▶ Call to build_backend.build_wheel failed (exit status: 1)
767.9 hint: This usually indicates a problem with the package or the build
767.9 environment.
767.9 DEBUG Released lock at /root/.cache/uv/.lock

ERROR: failed to build: failed to solve: process “/bin/sh -c cd flashinfer-cubin && uv build --no-build-isolation --wheel . --out-dir=/workspace/wheels -v” did not complete successfully: exit code: 2

Where are you located? A few people reported about issues downloading cubins from NVIDIA servers outside of the US. In this case repeated attempts to build will eventually get you there as I have implemented caching of downloaded cubins (something that Flashinfer doesn’t do by default for some reason).

See this post for details: Does nvidia.com rate-limit legitimate requests associated with building flashinfer from source?

I’m in the process of separating flashinfer build and main docker build, so you will be able to just reuse nightly or stable wheels that I will generate and put in the repo.

Ahh ok. I’m in Germany. I could try again or vpn and see if that works. Thank you.

Edit: worked on the 3rd run!

Again great work, thank you!!

Btw, published the benchmarks for the spark-vllm-docker for Minimax M2.5 on 2 sparks yesterday

2 Likes

Anyone pruned it ( REAP ) by 10% ?

Got the 3-Bit gguf quantisation from unsloth running on a Jetson Thor. It requires an actual build of llama-cpp (found no way to get the gguf running with vllm).

git clone https://github.com/ggml-org/llama.cpp.git ~/llama-cpp/llama.cpp
cd ~/llama-cpp/llama.cpp
export PATH=/usr/local/cuda/bin:$PATH
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=110
cmake --build build --config Release -j$(nproc)

The CUDA architecture 110 corresponds to the Jetson Thor’s Blackwell GPU (compute capability 11.0), I guess it will be the same for DGX Spark. EDIT: It’s 12.1 on Spark, so use 121.

If you use the IQ3_XXS variant of the model, it requires around 93 GB RAM and leaves enough space to use the full context of 196K token (in that case around 118 GB RAM is used).

On the Jetson Thor, I get around 22 t/s, so I would expect around 25 - 28 t/s on a DGX spark.

I could successfully add it into VS Code Insiders using its agent model manager for generic models with OpenAI compatible API. Tool use works fine and generated results are quite well so far. Not as close to Opus 4.6 as claimed by their model card but definitely usable.

Pretty amazing that this is possible now on such a small machine!

DGX Spark is sm121 (12.1).

1 Like

I downloaded the QuantTrio/MiniMax-M2.5-AWQ model today and am running it on my stacked Sparks using @eugr 's latest repository code. It’s working quite well and I am impressed with the quality of the coding, particularly in Xcode as well as VSCode. It may be my new go-to coder. Once things are warmed up it responds very quickly in Xcode.

I ran it thus:

./launch-cluster.sh -n 169.254.246.76,169.254.28.198 -d exec vllm serve QuantTrio/MiniMax-M2.5-AWQ --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 128000 --load-format fastsafetensors --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think --trust-remote-code

I wonder if anyone can comment on the significance of the warnings I get when running the model in vllm:

 Unknown vLLM environment variable detected: VLLM_BASE_DIR
 Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
 Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.
 tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 291f13266a7e3b028a49f93727aa7e18968a2c1a877e18477f653d6c. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
 tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node c9b1653d1318944a2df4915722efe04c0d07d8159b0da604026fb2b3. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
 Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
 Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
 SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
 Custom allreduce is disabled because this process group spans across nodes.
 Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
 Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
 SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
 Custom allreduce is disabled because this process group spans across nodes.
 Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=768,device_name=NVIDIA_GB10.json
 Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=768,device_name=NVIDIA_GB10.json
 Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 40, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.

Obviously most are trivial, which is nice, and there are some duplicates because two Sparks.

I’m also puzzled that it won’t work without --trust-remote-code and yet:
(APIServer pid=1251) The argument trust_remote_codeis to be used with Auto classes. It has no effect here and is ignored.

My undying thanks to @eugr and the crew.

The warnings are normal, and can be safely ignored here. Basically it warns that tensor parallel will not be performing well unless you have fast interconnect (which you have). The others are just either related to cluster config (allreduce, etc) or model config (sampling params), and can also be ignored.

2 Likes

Title: GB10 (Project DIGITS) — OOM during sglang model loading after driver upgrade 580→590

Body:

Hardware: 2x NVIDIA GB10 (Project DIGITS), 128GB unified memory each (119GB usable)

Driver: 590.48.01 (was working on 580.126.09)

Kernel: 6.14.0-1015-nvidia (Ubuntu, aarch64)

Workload: Distributed inference with sglang 0.5.7 (nvcr.io/nvidia/sglang:26.01-py3), serving MiniMax-M2.5-FP8-INT4-AWQ (456B MoE, ~123GB

quantized on disk) across both nodes using TP=2, EP=2, nnodes=2.

Problem:

After upgrading both machines from driver 580.126.09 to 590.48.01, sglang consistently OOMs during model loading. The model was loading

and serving fine on driver 580 with the exact same container, model, and launch parameters.

The model uses CompressedTensorsWNA16MarlinMoEMethod, which loads pre-cached Marlin-repacked weights from disk (~100GB per node). During

the Marlin cache loading phase, memory climbs to 119/119GB and the system becomes unresponsive, eventually triggering OOM kills or a full

system hang requiring hard reboot.

On driver 580, the same loading process completed with roughly 5-7GB of headroom. Driver 590 appears to consume more memory on GB10

unified memory, eliminating that margin.

What I’ve tried:

  • Adding --cpu-offload-gb 30-50 to move weights to swappable CPU memory

  • Adding 280-513GB of SSD swap per node with vm.swappiness=90

  • Setting --max-total-tokens 4096 to minimize KV cache

  • Disabling cuda graph, custom all-reduce

  • Setting PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.5,expandable_segments:True,max_split_size_mb:512

With these workarounds, one node (50GB offload) completed loading with 1.06GB free, but the other node locked up from swap thrashing (90%

iowait) and became unresponsive after ~53 minutes.

Steps to reproduce:

1. GB10 with driver 590.48.01, kernel 6.14.0-1015-nvidia

2. Launch sglang with a model that uses ~110GB of Marlin-repacked weights per node (out of 119GB available)

3. Model loading OOMs or system hangs — same config works on driver 580.126.09

Questions:

1. Did the driver 590 memory management (CDMM or otherwise) change how unified memory is allocated on GB10?

2. Is there a known increase in driver memory overhead on GB10 between 580 and 590?

3. Is there a recommended driver version for GB10 inference workloads?

IN my test,2 dgxspark 128k meet 22tokens/s

don’t use that 590 driver , we all know it have memory leak problem.

Does anyone have a good recipe that’s stable on dual Sparks for Minimax with full context? I’ve been using Eugr’s recipe (which is fantastic and very stable) but it only has 120K context and I’d really like to get to that 200K mark (196K technically I believe is the max context window). Not even sure how much more I can squeeze out of the dual sparks, as they sit at ~100GB consumed memory a piece with just 120K context window with the recipe.

./launch-cluster.sh -t vllm-node -d start

docker exec vllm_node bash -i -c "vllm serve /models/cyankiwi-MiniMax-M2.5-AWQ-4bit \
    --trust-remote-code \
    --port 8000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.7 \
    -tp 2 \
    --distributed-executor-backend ray \
    --served-model-name minimax-m2.5-awq \
    --max-model-len 128000 \
    --load-format fastsafetensors \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2 \
    > /proc/1/fd/1 2> /proc/1/fd/2"

I guess I can increase the gpu-memory-utilization but is it safe/stable to have it at 8.5 or 9? Asking because I keep vllm running all the time and never bring the container down unless I want to switch models.

In my experience with headless single unit I can go rock solid to 0.93 at least.