Sharing a working setup in case it helps anyone else fighting this. We got MiniMax-M3 NVFP4 (lukealonso/MiniMax-M3-NVFP4, ~243GB) serving at real tensor-parallel 3 across 3 DGX Sparks (GB10, sm_121), with clean tool-calling and reasoning, no leaked control tokens. No 4th node.
Full recipe, launcher, and verify scripts:
The build is Luke Alonso’s vLLM fork (the chthonic build) plus b12x, and his fb63c9a “Support MiniMax M3 TP3 virtual sharding” commit is what makes the 64 attention / 4 KV heads divisible by 3 (auto at --tensor-parallel-size 3). Full credit to Luke.
The parts that aren’t documented anywhere and cost us the most time were the head-node OOM fixes:
1. --load-format safetensors. instanttensor’s GDS open() throws under torch 2.12 on Spark (no GPUDirect Storage).
2. --object-store-memory 1073741824 on every ray start. Ray reserves ~30 percent of RAM (~36GB/node) for a plasma object store that vLLM TP never uses (tensors go over NCCL). On the head that reserve plus the 84GB shard plus KV overcommits the 121GB box and you hit NVRM: Out of memory during weight load. Capping it freed ~35GB/node.
3. RAY_memory_monitor_refresh_ms=0. After a fully successful warmup the head sits at ~96 percent RAM, which is normal on unified memory. Ray’s 95 percent memory monitor then false-kills the rank-0 worker (NODE_OUT_OF_MEMORY) even though there is no real OOM (no NVRM, no Linux kill, ~4.4GB free). Disable the monitor; the kernel and driver stay the real backstop.
Where it’s rough, and where we would love help:
Single-stream is only ~6 tok/s. The bottleneck is the interconnect, not compute. NCCL is running over the 1GbE management NIC, and TP=3 does ~120 cross-node all-reduces per token. The 200G ConnectX-7 ports sit unused for model traffic. We have a switchless RoCE-ring fix drafted in the repo (unset NCCL_IB_GID_INDEX, per-connection GID via NCCL_IB_ADDR_RANGE, and NCCL_NET_GDR_LEVEL=0 which is mandatory on GB10), but it is not landed yet. If you have switchless 3-node RoCE working with NCCL on Sparks, we want your config.
EAGLE3 spec-decode: the chthonic M3 class implements SupportsEagle3 and Inferact/MiniMax-M3-EAGLE3 loads, but the bf16 draft against the NVFP4 target dead-ends in vLLM’s draft-quant path. If anyone has run an eagle3 draft against an NVFP4 target, or has a quantized M3 eagle3 draft, please chime in.
The whole point of publishing this is to let people tinker and fix what we got wrong. PRs and corrections welcome.