Working recipe: MiniMax-M3 NVFP4 at TP=3 on 3x DGX Spark (no 4th node) + the OOM fixes

Sharing a working setup in case it helps anyone else fighting this. We got MiniMax-M3 NVFP4 (lukealonso/MiniMax-M3-NVFP4, ~243GB) serving at real tensor-parallel 3 across 3 DGX Sparks (GB10, sm_121), with clean tool-calling and reasoning, no leaked control tokens. No 4th node.

Full recipe, launcher, and verify scripts:

The build is Luke Alonso’s vLLM fork (the chthonic build) plus b12x, and his fb63c9a “Support MiniMax M3 TP3 virtual sharding” commit is what makes the 64 attention / 4 KV heads divisible by 3 (auto at --tensor-parallel-size 3). Full credit to Luke.

The parts that aren’t documented anywhere and cost us the most time were the head-node OOM fixes:

1. --load-format safetensors. instanttensor’s GDS open() throws under torch 2.12 on Spark (no GPUDirect Storage).

2. --object-store-memory 1073741824 on every ray start. Ray reserves ~30 percent of RAM (~36GB/node) for a plasma object store that vLLM TP never uses (tensors go over NCCL). On the head that reserve plus the 84GB shard plus KV overcommits the 121GB box and you hit NVRM: Out of memory during weight load. Capping it freed ~35GB/node.

3. RAY_memory_monitor_refresh_ms=0. After a fully successful warmup the head sits at ~96 percent RAM, which is normal on unified memory. Ray’s 95 percent memory monitor then false-kills the rank-0 worker (NODE_OUT_OF_MEMORY) even though there is no real OOM (no NVRM, no Linux kill, ~4.4GB free). Disable the monitor; the kernel and driver stay the real backstop.

Where it’s rough, and where we would love help:

Single-stream is only ~6 tok/s. The bottleneck is the interconnect, not compute. NCCL is running over the 1GbE management NIC, and TP=3 does ~120 cross-node all-reduces per token. The 200G ConnectX-7 ports sit unused for model traffic. We have a switchless RoCE-ring fix drafted in the repo (unset NCCL_IB_GID_INDEX, per-connection GID via NCCL_IB_ADDR_RANGE, and NCCL_NET_GDR_LEVEL=0 which is mandatory on GB10), but it is not landed yet. If you have switchless 3-node RoCE working with NCCL on Sparks, we want your config.

EAGLE3 spec-decode: the chthonic M3 class implements SupportsEagle3 and Inferact/MiniMax-M3-EAGLE3 loads, but the bf16 draft against the NVFP4 target dead-ends in vLLM’s draft-quant path. If anyone has run an eagle3 draft against an NVFP4 target, or has a quantized M3 eagle3 draft, please chime in.

The whole point of publishing this is to let people tinker and fix what we got wrong. PRs and corrections welcome.

You have a good writeup about NCCL on 3 nodes here:

Hopefully that is the last piece of the puzzle.

You can launch without Ray too. Spark-vllm-docker and Sparkrun can run any vLLM container on multi-node cluster with or without Ray, so with Spark-vLLM-docker you can just point to your image tag and use --no-ray flag.

And as a previous poster said, you need a special NCCL build and properly configured cluster - the doc referenced above will help.

I’m a bit behind on PRs, but would appreciate a contribution to our community build, once tp3 is working. There are quite a few of us with 3 node clusters.

I think I got Eagle to work I’m going to follow up real shortly but I’ll also send this to my agent too I’m having issues with ROCe I don’t know if you can help me here

On the direct-cable (no switch) 3-node Spark mesh: my raw ib_write_bw between two nodes caps at ~12.8 Gb/s and does NOT scale with queue pairs (q=4 and q=16 both land at 12.8), even though PCIe is Gen5 x4 full width, ethtool shows 200G, RoCEv2 GID index 3, active_mtu 4096. What unlocks your 111 Gb/s — PFC/DCB lossless config, an mlxconfig firmware setting, ECN/DCQCN, a specific cable/transceiver, or does it just work out of the box on yours? Anything DGX-Spark-specific on the CX7 ports I’d be missing?

It is possible you just need to power off the nodes, unplug power bricks for a minute and plug it all back in and power up. You are not the first nor the last to experiance this behaviour.

Your saying to get the 200g ? I will try that next

Yep, exactly what I got first time my 2 nodes were connected.

Just a word of caution - Connect X 7 is capped by PCI at 200G total, across both ports, if both are operating expect 100G per link, 200G total

3x DGX Spark TP=3 update: vLLM/NCCL is staged for RoCE, but we’re chasing two things: err-110 from switchless mesh HCA pairing, and a raw RoCE cap around 12.8 Gb/s. Testing patched NCCL 2.30.7 launcher now; power-drain next if bandwidth stays stuck.

Update: err-110 is dead. Root cause was vLLM loading an old baked NCCL/LD_PRELOAD shim (2.30.4) instead of our 2.30.7 build. Patched launcher now shows FORCED_NCCL_VERSION 23007, NCCL rings connect over RoCE cleanly on 3x DGX Spark. Bench next; bandwidth cap remains.

I think there is a line in the 3-node config to prevent the 10G links from being used by NCCL, in case the traffic still is going the wrong way.

Yeah, that was our concern too. We’re using enP7s7 only for NCCL socket/bootstrap and pinning the RoCE HCAs for data via NCCL_IB_HCA. Latest patched run shows NCCL 2.30.7, rings connected, no err-110; checking logs now to confirm NET/IB vs NET/Socket and then benching.

Mashie everything is working the only issue is the 12gb/s I am about to do your last test to power down and see if it works and resets

restart worked ! 100gb/s now benching

Just follow the documentation here: spark-vllm-docker/docs/NETWORKING.md at main · eugr/spark-vllm-docker · GitHub

The important bits:

  • You need to build NCCL 2.30u1 from source for 3 node config to work properly
  • You need to use 10G adapter IPs for OOB communication
  • You need to connect cables the way it’s described in the doc, otherwise you won’t be getting the full speed.
  • You need to set the environment variables properly (like in the doc) - subnet aware routing and do not merge NICs specifically.

Thanks eugr, your NETWORKING.md got us there. We did all four: built NCCL 2.30u1 from source for sm_121, OOB on the 10G mgmt NIC, the 3-node switchless mesh with per-leg /30 subnets and RoCEv2 GID index 3, and subnet-aware-routing with MERGE_NICS=0.

Two extra things we hit that might help others:

  1. The vLLM container had a baked LD_PRELOAD pointing at an older nccl “local-inference” 2.30.4 shim. A baked LD_PRELOAD beats both a symlink swap and an LD_LIBRARY_PATH prepend, so it silently kept loading 2.30.4 (the banner read 2.30.4 even with 2.30u1 installed), and that shim lacked the working subnet-aware override, which kept throwing ibv_modify_qp err 110 on the switchless mesh. Forcing LD_PRELOAD to the 2.30u1 lib and unsetting the shim env vars fixed it.

  2. Raw ib_write_bw was stuck at ~12.8 Gb/s and would not scale with queue pairs, on healthy Gen5 x4 / 200G hardware. A full cold power-drain (power off, unplug the bricks ~90s, power back on) cleared it to 111.85 Gb/s, exactly your number. A warm reboot did not do it.

TP=3 MiniMax-M3 now serving clean over RoCE at 200K context. Full writeup: GitHub - tonyd2wild/minimax-m3-dgx-spark-tp3: Working recipe: MiniMax-M3 NVFP4 at tensor-parallel 3 across 3x DGX Spark (GB10/sm_121) with clean tool-calling. Includes the head-node OOM fixes and multi-node Ray/NCCL setup. Open for tinkering + fixes. · GitHub

Great work, big step forwards. What do you think the t/s speeds would be if using a switch for the 3 nodes to get full 200GB/s network speed?

From what I am seeing you don’t get a speed boost you get more concurrency. I don’t know how true that is but basically I went from 12gb/s to 100gb/s and the speed did not move concurrency did…

bullerwins/MiniMax-M3-4bit-W4A16-v0 · Hugging Face would this version work on a 2x DGX spark node? It’s less than 227gb on disk.