Working recipe: MiniMax-M3 NVFP4 at TP=3 on 3x DGX Spark (no 4th node) + the OOM fixes

tonyd615 · June 15, 2026, 1:54pm

Sharing a working setup in case it helps anyone else fighting this. We got MiniMax-M3 NVFP4 (lukealonso/MiniMax-M3-NVFP4, ~243GB) serving at real tensor-parallel 3 across 3 DGX Sparks (GB10, sm_121), with clean tool-calling and reasoning, no leaked control tokens. No 4th node.

Full recipe, launcher, and verify scripts:

The build is Luke Alonso’s vLLM fork (the chthonic build) plus b12x, and his fb63c9a “Support MiniMax M3 TP3 virtual sharding” commit is what makes the 64 attention / 4 KV heads divisible by 3 (auto at --tensor-parallel-size 3). Full credit to Luke.

The parts that aren’t documented anywhere and cost us the most time were the head-node OOM fixes:

1. --load-format safetensors. instanttensor’s GDS open() throws under torch 2.12 on Spark (no GPUDirect Storage).

2. --object-store-memory 1073741824 on every ray start. Ray reserves ~30 percent of RAM (~36GB/node) for a plasma object store that vLLM TP never uses (tensors go over NCCL). On the head that reserve plus the 84GB shard plus KV overcommits the 121GB box and you hit NVRM: Out of memory during weight load. Capping it freed ~35GB/node.

3. RAY_memory_monitor_refresh_ms=0. After a fully successful warmup the head sits at ~96 percent RAM, which is normal on unified memory. Ray’s 95 percent memory monitor then false-kills the rank-0 worker (NODE_OUT_OF_MEMORY) even though there is no real OOM (no NVRM, no Linux kill, ~4.4GB free). Disable the monitor; the kernel and driver stay the real backstop.

Where it’s rough, and where we would love help:

Single-stream is only ~6 tok/s. The bottleneck is the interconnect, not compute. NCCL is running over the 1GbE management NIC, and TP=3 does ~120 cross-node all-reduces per token. The 200G ConnectX-7 ports sit unused for model traffic. We have a switchless RoCE-ring fix drafted in the repo (unset NCCL_IB_GID_INDEX, per-connection GID via NCCL_IB_ADDR_RANGE, and NCCL_NET_GDR_LEVEL=0 which is mandatory on GB10), but it is not landed yet. If you have switchless 3-node RoCE working with NCCL on Sparks, we want your config.

EAGLE3 spec-decode: the chthonic M3 class implements SupportsEagle3 and Inferact/MiniMax-M3-EAGLE3 loads, but the bf16 draft against the NVFP4 target dead-ends in vLLM’s draft-quant path. If anyone has run an eagle3 draft against an NVFP4 target, or has a quantized M3 eagle3 draft, please chime in.

The whole point of publishing this is to let people tinker and fix what we got wrong. PRs and corrections welcome.

mashie · June 15, 2026, 2:20pm

You have a good writeup about NCCL on 3 nodes here:

github.com/eugr/spark-vllm-docker

docs/NETWORKING.md

main

# DGX Spark Networking

The following guide starts with a two-node cluster, but it is also applicable to larger clusters.

See [this post](https://forums.developer.nvidia.com/t/6x-spark-setup/354399/56) for an example of 6-8 node Spark cluster.
Please keep in mind that tensor-parallel vLLM deployments usually work best with a number of nodes that corresponds to a power of 2, such as 2, 4, or 8 nodes. A 3-node mesh is mainly useful for pipeline parallelism or data parallelism.

The guide assumes that the nodes are named `spark` and `spark2`, but you can use any names.
Same with IP addresses: we use `192.168.177.0/24` subnet with `.11` and `.12` assigned to both nodes, but you can use any IP addresses, as long as they are in the same subnet.

## DGX Spark ConnectX quirks

DGX Spark has a pretty unique ConnectX setup.

To achieve 200G transfer speed, ConnectX NIC needs ~x8 PCIe 5.0 lanes.

However, DGX Spark SOC can't provide more than x4 PCIe lanes per device due to hardware limitations.
So to achieve 200G on a single cable connection, each physical port shares the same pair of PCIe5 x4 connections.
Each PCIe 5 x4 link is represented by two Ethernet and two RoCE interfaces:

This file has been truncated. show original

Hopefully that is the last piece of the puzzle.

eugr_nv · June 15, 2026, 3:02pm

You can launch without Ray too. Spark-vllm-docker and Sparkrun can run any vLLM container on multi-node cluster with or without Ray, so with Spark-vLLM-docker you can just point to your image tag and use --no-ray flag.

And as a previous poster said, you need a special NCCL build and properly configured cluster - the doc referenced above will help.

I’m a bit behind on PRs, but would appreciate a contribution to our community build, once tp3 is working. There are quite a few of us with 3 node clusters.

tonyd615 · June 15, 2026, 3:27pm

I think I got Eagle to work I’m going to follow up real shortly but I’ll also send this to my agent too I’m having issues with ROCe I don’t know if you can help me here

tonyd615 · June 15, 2026, 4:03pm

On the direct-cable (no switch) 3-node Spark mesh: my raw ib_write_bw between two nodes caps at ~12.8 Gb/s and does NOT scale with queue pairs (q=4 and q=16 both land at 12.8), even though PCIe is Gen5 x4 full width, ethtool shows 200G, RoCEv2 GID index 3, active_mtu 4096. What unlocks your 111 Gb/s — PFC/DCB lossless config, an mlxconfig firmware setting, ECN/DCQCN, a specific cable/transceiver, or does it just work out of the box on yours? Anything DGX-Spark-specific on the CX7 ports I’d be missing?

mashie · June 15, 2026, 4:13pm

It is possible you just need to power off the nodes, unplug power bricks for a minute and plug it all back in and power up. You are not the first nor the last to experiance this behaviour.

tonyd615 · June 15, 2026, 4:15pm

Your saying to get the 200g ? I will try that next

mashie · June 15, 2026, 4:35pm

Yep, exactly what I got first time my 2 nodes were connected.

0rand · June 15, 2026, 4:50pm

Just a word of caution - Connect X 7 is capped by PCI at 200G total, across both ports, if both are operating expect 100G per link, 200G total

tonyd615 · June 15, 2026, 5:40pm

3x DGX Spark TP=3 update: vLLM/NCCL is staged for RoCE, but we’re chasing two things: err-110 from switchless mesh HCA pairing, and a raw RoCE cap around 12.8 Gb/s. Testing patched NCCL 2.30.7 launcher now; power-drain next if bandwidth stays stuck.

tonyd615 · June 15, 2026, 5:42pm

Update: err-110 is dead. Root cause was vLLM loading an old baked NCCL/LD_PRELOAD shim (2.30.4) instead of our 2.30.7 build. Patched launcher now shows FORCED_NCCL_VERSION 23007, NCCL rings connect over RoCE cleanly on 3x DGX Spark. Bench next; bandwidth cap remains.

mashie · June 15, 2026, 5:44pm

I think there is a line in the 3-node config to prevent the 10G links from being used by NCCL, in case the traffic still is going the wrong way.

tonyd615 · June 15, 2026, 5:49pm

Yeah, that was our concern too. We’re using enP7s7 only for NCCL socket/bootstrap and pinning the RoCE HCAs for data via NCCL_IB_HCA. Latest patched run shows NCCL 2.30.7, rings connected, no err-110; checking logs now to confirm NET/IB vs NET/Socket and then benching.

tonyd615 · June 15, 2026, 6:02pm

Mashie everything is working the only issue is the 12gb/s I am about to do your last test to power down and see if it works and resets

tonyd615 · June 15, 2026, 6:25pm

restart worked ! 100gb/s now benching

eugr_nv · June 15, 2026, 7:04pm

Just follow the documentation here: spark-vllm-docker/docs/NETWORKING.md at main · eugr/spark-vllm-docker · GitHub

The important bits:

You need to build NCCL 2.30u1 from source for 3 node config to work properly
You need to use 10G adapter IPs for OOB communication
You need to connect cables the way it’s described in the doc, otherwise you won’t be getting the full speed.
You need to set the environment variables properly (like in the doc) - subnet aware routing and do not merge NICs specifically.

tonyd615 · June 15, 2026, 7:41pm

Thanks eugr, your NETWORKING.md got us there. We did all four: built NCCL 2.30u1 from source for sm_121, OOB on the 10G mgmt NIC, the 3-node switchless mesh with per-leg /30 subnets and RoCEv2 GID index 3, and subnet-aware-routing with MERGE_NICS=0.

Two extra things we hit that might help others:

The vLLM container had a baked LD_PRELOAD pointing at an older nccl “local-inference” 2.30.4 shim. A baked LD_PRELOAD beats both a symlink swap and an LD_LIBRARY_PATH prepend, so it silently kept loading 2.30.4 (the banner read 2.30.4 even with 2.30u1 installed), and that shim lacked the working subnet-aware override, which kept throwing ibv_modify_qp err 110 on the switchless mesh. Forcing LD_PRELOAD to the 2.30u1 lib and unsetting the shim env vars fixed it.
Raw ib_write_bw was stuck at ~12.8 Gb/s and would not scale with queue pairs, on healthy Gen5 x4 / 200G hardware. A full cold power-drain (power off, unplug the bricks ~90s, power back on) cleared it to 111.85 Gb/s, exactly your number. A warm reboot did not do it.

TP=3 MiniMax-M3 now serving clean over RoCE at 200K context. Full writeup: GitHub - tonyd2wild/minimax-m3-dgx-spark-tp3: Working recipe: MiniMax-M3 NVFP4 at tensor-parallel 3 across 3x DGX Spark (GB10/sm_121) with clean tool-calling. Includes the head-node OOM fixes and multi-node Ray/NCCL setup. Open for tinkering + fixes. · GitHub

miken · June 15, 2026, 7:47pm

Great work, big step forwards. What do you think the t/s speeds would be if using a switch for the 3 nodes to get full 200GB/s network speed?

tonyd615 · June 15, 2026, 7:49pm

From what I am seeing you don’t get a speed boost you get more concurrency. I don’t know how true that is but basically I went from 12gb/s to 100gb/s and the speed did not move concurrency did…

corbett_korbett · June 15, 2026, 8:13pm

bullerwins/MiniMax-M3-4bit-W4A16-v0 · Hugging Face would this version work on a 2x DGX spark node? It’s less than 227gb on disk.

Topic		Replies	Views
Minimax M3 on 3 Sparks TP=3 is now working DGX Spark / GB10 Projects	1	697	June 15, 2026
NCCL all-reduce deadlock on dual DGX Spark after successful channel establishment — affects both vLLM and TRT-LLM DGX Spark / GB10 nemotron	21	858	April 17, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2812	December 25, 2025
MiniMax-M3-AWQ running TP=4 across 4× DGX Spark (GB10) 33 tok/s — full recipe + the GB10 build fixes DGX Spark / GB10	3	308	July 3, 2026
6x Spark setup DGX Spark / GB10	112	11526	April 25, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	2065	February 13, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	34	2822	May 1, 2026
Three node Spark clusters (without a switch) are now supported in spark-vllm-docker and sparkrun! DGX Spark / GB10 llama	15	2605	July 19, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5912	December 9, 2025
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	2076	April 28, 2026

Working recipe: MiniMax-M3 NVFP4 at TP=3 on 3x DGX Spark (no 4th node) + the OOM fixes

Related topics