The confirmed state for DeepSeek-V4-Flash on DGX Spark has been hazyumps’ 2-node TP=2 recipe (~31–34 tok/s) and a couple of single-node experiments. One tweet from @TheAhmadOsmanTheAhmadOsmanTheAhmadOsmanTheAhmadOsman in early June said V4-Flash was running on his 4× Spark, but no recipe and no numbers were ever posted. Searched everywhere — nobody had published a verified 4× Spark recipe for V4-Flash with tok/s before this thread.
It works. 49.4 tok/s single-stream on the canonical Python-function probe, peaking at 54.4 on the longer reasoning ones. With max-num-seqs tuned up (see below), concurrent throughput at n=8 hits 180 tok/s aggregate with peak generation 207 tok/s. Here’s exactly what I ran, and the three traps that burned me four hours before I figured out the actual fix was a single line in hazyumps’ own BUILD.md that I’d glossed over.
TL;DR
- 4× DGX Spark (GB10 / sm_121a), 128 GiB UMA per node, ConnectX-7 RoCE through MikroTik 200 GbE
- vLLM from jasl/vllm fork (PR #41834) + hazyumps’
sm12x_deep_gemm_fallbacks.pypatch bind-mounted - TP=4, EP enabled, MTP n=2, FP8 KV, 384K context, full+piecewise CUDA graphs
- Single-stream: 49.4 tok/s canonical, 54.4 peak on reasoning probes
- conc8: 180 tok/s aggregate, peak generation 207 tok/s (with
max-num-seqs=8— see tuning section) - The one fix that mattered: NCCL 2.30.4. 2.28.9 hard-wedges every long generation.
Hardware + software
4× DGX Spark (GB10, sm_121a), 128 GiB unified LPDDR5x per node. ConnectX-7 RoCE NICs (visible as rocep1s0f0 via ibv_devices). MikroTik CRS520 200 GbE switch.
Container built from jasl/vllm fork (codex/ds4-sm120-min-enable branch / PR #41834 line), vLLM tag v20260613.dev297+ga93b9098b, CUDA 13.0 sbsa base aarch64. Critical addition: overlayed the container with libnccl2=2.30.4-1+cuda13.2 because the CUDA 13.0 base ships 2.28.9 by default. This was the whole thing.
Hazyumps’ sm12x_deep_gemm_fallbacks.py patch is bind-mounted over /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py — live-restartable, no rebuild needed to update.
Model is deepseek-ai/DeepSeek-V4-Flash (official FP8 weights), 149 GB, staged to /home/chris/models/DeepSeek-V4-Flash on each Spark. RoCE rendezvous on the 10.100.96.x range, master at port 29519, NCCL bootstrap over the same interface.
Cache mounts (/root/.cache/vllm and /root/.triton/cache) bind-mounted from host directories — required, not optional. See Trap 3.
The NCCL 2.30.4 overlay (the one that actually mattered)
Hazyumps’ BUILD.md says “NCCL 2.30.4 present in the image.” That’s the target — your build needs to actually pin it. CUDA 13.0 sbsa resolves libnccl2 to 2.28.9 unless you say otherwise. And you won’t see 2.30.4 in apt list until you run apt-get update to refresh the local cache — the build image ships with empty apt lists, so a casual check makes it look like 2.28.9 is all that’s available.
Overlay Dockerfile (~50 seconds to build):
FROM vllm-ds4-sm121:cu130
RUN apt-get update && \
apt-get install -y --allow-downgrades --allow-change-held-packages \
libnccl2=2.30.4-1+cuda13.2 \
libnccl-dev=2.30.4-1+cuda13.2 && \
rm -rf /var/lib/apt/lists/*
Build, tag, distribute to all 4 nodes (I used docker save | docker load over QSFP). After this, hazyumps’ LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libnccl.so.2.30.4 env var actually finds the lib. Without the overlay, that LD_PRELOAD fails with a warning that’s easy to miss in the cold-start log noise (“cannot be preloaded: ignored.”) and you have no idea you’re still running 2.28.9.
Verify after launch with docker logs vllm-ds4 2>&1 | grep "NCCL version" — should show NCCL version 2.30.4+cuda13.2.
The launch
Same docker invocation across all 4 nodes — only --node-rank and VLLM_HOST_IP differ, workers add --headless. I’ll describe the load-bearing pieces in prose; the full launch script is ~90 lines and mostly mirrors hazyumps’ 2-node template.
RDMA wiring. Three things: NCCL env vars for the RoCE HCA (NCCL_IB_HCA=rocep1s0f0, NCCL_IB_DISABLE=0), matching NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME for the same interface, and on the docker command --device=/dev/infiniband plus --cap-add=IPC_LOCK plus --ulimit memlock=-1:-1. The LD_PRELOAD to libnccl.so.2.30.4 pins the NCCL version (only works after the overlay above).
Memory and compile. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (reduces unified-memory fragmentation on the big KV allocs), TORCH_CUDA_ARCH_LIST=12.1a, VLLM_TRITON_MLA_SPARSE=1, VLLM_ALLOW_LONG_MAX_MODEL_LEN=1.
Mounts. Model directory, the sm12x_deep_gemm_fallbacks.py patch bind-mounted over the in-image path, and the two cache directories. Make the cache directories on the host before docker run.
vLLM flags worth calling out. TP=4, PP=1, nnodes=4, --enable-expert-parallel, --distributed-executor-backend mp, --master-addr pointed at the head’s RoCE IP on port 29519. KV cache: --kv-cache-dtype fp8, --block-size 256, --enable-prefix-caching. Context: --max-model-len 393216 (model max is 1M; I kept hazyumps’ value). Compute: --max-num-seqs 8 (was 2 in the conservative baseline, see tuning), --max-num-batched-tokens 4096, --gpu-memory-utilization 0.80. CUDA graphs full+piecewise via --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'. MTP via --speculative-config '{"method":"deepseek_mtp","num_speculative_tokens":2}'. Parsers: --reasoning-parser deepseek_v4 and --tool-call-parser deepseek_v4 plus --enable-auto-tool-choice. Standard --load-format safetensors, --no-enable-flashinfer-autotune (saves a 10+ minute startup autotune), --trust-remote-code, --tokenizer-mode deepseek_v4.
One gotcha worth repeating before you copy-paste hazyumps’ start scripts: the jasl/vllm image’s ENTRYPOINT is [vllm serve]. If you keep her "$IMAGE" vllm serve "$MODEL" invocation verbatim, you’ll double-stack and get vllm: error: unrecognized arguments: serve /models/DeepSeek-V4-Flash. Drop the leading vllm serve from your docker command. Workers add --headless. Head adds --host 0.0.0.0 --port 8000.
Cold boot is ~3 minutes from docker run to “Application startup complete.” First inference call takes ~30 seconds extra while Triton JITs compile shapes the warmup didn’t cover; subsequent calls are hot.
Results
Locked probe battery, temperature 0.3, single-stream means API quiescent, all run after the first warmup call primed the kernel cache.
| Probe | tokens | wall | tok/s | finish |
|---|---|---|---|---|
| c1 spec→code (JSONL filter) | 897 | 18.15s | 49.4 | stop |
| c2 multi-file project | 2,339 | 45.44s | 51.5 | stop |
| c3 bugfix (threading) | 504 | 9.35s | 53.9 | stop |
| c4 refactor | 378 | 7.72s | 49.0 | stop |
| o1 customer email | 398 | 9.32s | 42.7 | stop |
| o3 JSON extraction | 522 | 9.91s | 52.7 | stop |
| l1 probability | 1,195 | 21.97s | 54.4 | stop |
| l2 logic puzzle | 2,048 | 37.63s | 54.4 | length |
| t1 tool calling | 90 | 7.91s | 11.4 | tool_calls |
| conc4 (max-num-seqs=8) | 1,142 total | 16.13s | 70.8 agg | all stop |
| conc8 (max-num-seqs=8) | 4,096 total | 22.77s | 179.9 agg (peak gen 207) | all length |
Cold sanity probe (cache empty, JIT during inference) ran at 33.5 tok/s. After the first request warmed the cache into the mounted volume, every single-stream call lives at 49–54.
MTP draft acceptance rate ran 0.69–0.77 mean across probes. Hazyumps reports closer to 0.80 on 2-Spark — possibly something I’m missing in draft-model warmup; haven’t chased it.
The t1 tool-call rate (11.4 tok/s) is ~1/5 of text rate. Same pattern I’ve seen on GLM-4.7 and Qwen3.5-397B on this cluster — structured output doesn’t get the spec-decode benefit. Expected, not a problem.
The conc4/conc8 numbers above are after tuning max-num-seqs=8. The baseline value from hazyumps’ 2-Spark recipe is 2, and at that setting conc4 was 37 agg, conc8 was 84 agg. The tuning lift is real and substantial — see below.
Tuning: max-num-seqs lift
Hazyumps validated max-num-seqs=2 on 2-Spark. With 4 nodes there’s headroom; I tried 4 and 8 single-variable.
| Setting | conc4 agg | conc8 agg | peak gen |
|---|---|---|---|
| baseline (2) | 37.2 | 84.3 | 87.5 |
| max-num-seqs=4 | 62.3 | 109.9 | 138.8 |
| max-num-seqs=8 | 70.8 | 179.9 | 206.7 |
conc8 head log during the =8 run showed Running: 8 reqs, Waiting: 0 reqs — vLLM was actually running all 8 in parallel, no queue. 4× DGX Spark with TP=4 + MTP + RDMA has comfortable concurrency headroom on this model.
Traps that cost me time
Trap 1: NCCL 2.28.9 with TP=4 wedges every long generation
Headline trap. Symptoms: sanity probe with short generation works clean at ~33 tok/s. First long generation (anything producing >500 tokens of reasoning trace) decays generation throughput to 0.0 after ~30 seconds. Engine logs show shm_broadcast: No available shared memory broadcast block found in 60 seconds markers spaced 60s apart, then TimeoutError: RPC call to sample_tokens timed out, then EngineDeadError. Container restarts. Workers don’t — they zombie with broken-pipe loops to the dead head’s TCPStore. Have to kill+restart all 4 nodes by hand.
I burned three full restart cycles before catching this. The symptom looks like a JIT compile hang — and there ARE Triton compiles happening during inference (warmup doesn’t cover all shapes). But that’s the visible symptom, not the cause. 2.30.4’s collective primitives synchronize TP workers across the compile-latency divergence; 2.28.9 doesn’t. With 2.30.4 + LD_PRELOAD in place, the same JIT compiles still happen during inference, but they complete in ~1 second per kernel, workers stay in sync, the request finishes cleanly.
Hazyumps’ TUNING.md literally says “The NCCL 2.30.4 upgrade eliminates the original hard wedges.” They were right. I just didn’t believe it was load-bearing until I’d ruled out everything else.
Trap 2: The container ENTRYPOINT doubles your command
The jasl/vllm image’s ENTRYPOINT is [vllm serve]. If you copy hazyumps’ start scripts verbatim, your "$IMAGE" vllm serve "$MODEL" becomes vllm vllm serve $MODEL and you get vllm: error: unrecognized arguments: serve /models/DeepSeek-V4-Flash.
Fix: drop the leading vllm serve from the docker command. Just pass "$IMAGE" "$MODEL" $FLAGS. About ten minutes from first launch to noticing. Easy to skip if you grep your error.
Trap 3: Cache mounts are required, not optional
Hazyumps mentions this in passing in TUNING.md: “triton-cache + vllm-cache volume mounts persist compiled kernels across restarts (else a recompile ‘hang’ every boot).”
I read “every boot” and thought “well I’m not rebooting much, I’ll add them later.” Wrong. Even without rebooting, the JIT compile on the first inference call after a container restart causes the exact same wedge as Trap 1 if your NCCL isn’t 2.30.4. And once you ARE on 2.30.4, the cache mounts are how you make sure subsequent restarts don’t repeat the cold-cache hit. Make the host directories first, then mount them at the standard paths inside the container.
Bonus: --runtime nvidia if your docker doesn’t have it installed
Hazyumps’ scripts use --runtime nvidia --gpus all. If your Spark’s docker doesn’t have the nvidia runtime registered, you get Error response from daemon: unknown or invalid runtime name: nvidia in 90-point bold. --gpus all alone works on my Sparks. Easy fix, caught immediately.
What’s still on the table
max-num-batched-tokenspast 4096. vLLM literally warned me about this — speculative-decoding setting under-saturates the scheduler at 4096.max-model-lenpast 393216. Model’s max is 1M; I kept it at 384K to match hazyumps. The cluster has the unified-memory headroom for more.- MTP acceptance rate tuning. Hot cache gives 0.69–0.77; hazyumps reports ~0.80 on 2-Spark. Possibly a draft-model warmup or compilation-shape thing.
Credit
This recipe is almost entirely hazyumps’ work scaled up by one factor. Specifically:
- hazyumps/deepseek-v4-flash-gb10 — 2× Spark recipe, BUILD.md, TUNING.md, NETWORK.md, and the
sm12x_deep_gemm_fallbacks.pypatch. Everything load-bearing came from this. The “wedge vs. recoverable stall” calibration in TUNING.md saved me at least two rounds of unnecessary debugging. - jasl/vllm fork (PR #41834 /
codex/ds4-sm120-min-enable) — the sm_12x DeepSeek-V4 substrate. Without this, upstream vLLM crashes on load@TheAhmadOsmanTheAhmadOsmanon GB10. - @TheAhmadOsman’s tweet in early June saying V4-Flash was running on his 4× Spark — first public claim that 4-node was possible. No recipe came with it, but it’s the reason I tried in the first place.
If you’re in this space and haven’t read hazyumps’ TUNING.md, do that before this post.
Sources
- hazyumps/deepseek-v4-flash-gb10: GitHub - hazyumps/deepseek-v4-flash-gb10: Serve DeepSeek-V4-Flash on 2x NVIDIA GB10 / DGX Spark (sm_121) with vLLM — sm_121 indexer patch (bf16 + fused tf32 Triton top-k), tuned dual-Spark config (TP=2+EP, NCCL 2.30.4/RDMA, 384K, MTP), runbook + verify harness. · GitHub
- jasl/vllm fork (PR #41834): [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes by jasl · Pull Request #41834 · vllm-project/vllm · GitHub
- jasl/vllm-ds4-sm120-harness: https://github.com/jasl/@TheAhmadOsmanllm-ds4-sm120-harness
- @TheAhmadOsman tweet: Ahmad on X: "DeepSeek V4 Flash is now running on 4x DGX Spark / GB10 cluster Had to patch several things in vLLM to get it up w/ PyTorch fallbacks Targeted kernel optimization is next up P.S. Codex Cli w/ GPT-5.5 XHIGH handled the whole thing on its own, now we optimize those GB10 kernels https://t.co/0zX8MeCC2i" / X
Happy to answer questions.