Another dual-Spark data point — relax base + scheduler cherry-picks (Ray + RoCE), and the GB10 gotchas that actually mattered
Adding my setup since it differs a bit from the recipes above — Ray backend, RoCE interconnect, a different commit pin — and lands in the same ballpark. The part worth your time is the four
GB10/UMA things that actually fixed it; some advice in this thread is chasing the wrong layer.
Hardware: 2× DGX Spark GB10 (sm_121), 121 GiB UMA, driver 595.71.05, 200 Gbps RoCE between nodes. TP=2, Ray (not mp).
One thing worth clarifying: this model reports quant_method: fp8, but at runtime the MoE experts are actually FP4 → mxfp4.py MARLIN (expert_dtype resolved to ‘fp4’); only dense/attn/KV are
FP8. So the #41834 mxfp4 cleanup is on your path — but if your base already has the del + empty_cache, re-applying it is a no-op. Check before rebuilding.
Commit pin: jasl codex/ds4-sm120-min-enable, base edc82b614f51 (“Tune SM120 FP8 MQA logits row tile”, ~05-19) + 4 decode-protection cherry-picks:
git checkout edc82b614f51f4f9ce16c7010e879571e5c46136
git fetch origin codex/ds4-sm120-min-enable
for c in e1334312f4c67b5502ffc61438f9c559b73b5d1e
5dcd086fd1d58b74bd5849623a9e95dc32836a32
65da3607d70e08d399960795984efd2a9d52a4dd
e9c364bf93347f31b4a882cec815691194531b8c; do
git cherry-pick -x “$c”
done
Heads up: the branch rebases constantly, so SHAs rehash — match by subject if one doesn’t resolve. I tried the later HEAD (warmup-expansion + sparse-MLA split after relax) and it thrashed host
page-cache on startup at 8192×8 and locked both nodes — rolled back to relax. The 4 picks are throughput-neutral, just decode stability.
Prebuilt image:
docker pull Package vllm-spark · GitHub
Runtime image only (bring your own weights + serve command). GB10 patches baked in.
The 4 things that actually mattered on GB10/UMA:
- VLLM_SKIP_INIT_MEMORY_CHECK=1 — the key one. psutil and CUDA disagree on free memory on GB10, so vLLM’s pre-profile and post-profile memory asserts both abort with plenty of headroom. This
env bypasses both; a real OOM still surfaces at weight load.
- Wipe ~/.cache/vllm when you change image/build. Cousin of the Triton stale-cache bug (#41871): on sm_121, stale compiled artifacts get silently reused → garbled output, no crash. Container
recreation resets Triton’s cache but not a host-mounted ./.cache/vllm.
- Reboot between runs. Stopping the container leaves ~100 GiB stuck in the driver — rmmod nvidia_uvm won’t free it, only a reboot does. A/B without a reboot starts the second run
memory-starved.
- Re: the OOM-at-KV-alloc reports — never needed the host sysctl tuning. Clean-UMA boot + the memory-check bypass keeps the load+profiling spike (~33 GiB) inside 121 GiB at gpu_mem=0.85. The
sysctl route treats a symptom; this addresses where it actually aborts. (Multi-day page-cache creep is real though — I just reboot.)
Serve config:
TP=2 (Ray), gpu_mem=0.85, max_model_len=200000, bt=8192, max_num_seqs=8
–kv-cache-dtype fp8 --block-size 256 --enable-expert-parallel
–speculative-config ‘{“method”:“deepseek_mtp”,“num_speculative_tokens”:2}’ # MTP=2
–compilation-config ‘{“cudagraph_mode”:“FULL_AND_PIECEWISE”,“custom_ops”:[“all”]}’
env: TORCH_CUDA_ARCH_LIST=12.1a, VLLM_TRITON_MLA_SPARSE=1,
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1, VLLM_SKIP_INIT_MEMORY_CHECK=1
Numbers (llama-benchy, generation mode, 3 runs, single concurrency):
| test |
t/s |
peak |
e2e ttft (ms) |
| pp2048 |
1000 ± 73 |
— |
2256 ± 194 |
| tg32 |
34.0 ± 3.0 |
35.1 |
— |
| pp2048 @ d4096 |
1104 ± 3 |
— |
5322 ± 120 |
| tg32 @ d4096 |
36.9 ± 1.4 |
38.1 |
— |
| pp2048 @ d8192 |
1113 ± 45 |
— |
8522 ± 269 |
| tg32 @ d8192 |
34.7 ± 1.9 |
35.8 |
— |
| pp2048 @ d16384 |
809 ± 249 |
— |
22636 ± 5933 |
| tg32 @ d16384 |
34.6 ± 2.2 |
35.7 |
— |
| pp2048 @ d32768 |
1088 ± 4 |
— |
28993 ± 146 |
| tg32 @ d32768 |
25.8 ± 10 |
32.1 |
— |
| pp2048 @ d65536 |
996 ± 9 |
— |
61282 ± 585 |
| tg32 @ d65536 |
30.9 ± 2.3 |
32.1 |
— |
Prefill ~1000–1113 t/s flat to 64K, token-gen ~31–37 across all depths — matches ekkis. +1 to wolttam, 11 TPS isn’t where this lands; 30–40 single-stream (and ~65 peak at 8-way concurrency in
a separate sweep) is right. Cold boot ~14 min on a wiped cache; haven’t tried instanttensor yet.