DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps

The confirmed state for DeepSeek-V4-Flash on DGX Spark has been hazyumps’ 2-node TP=2 recipe (~31–34 tok/s) and a couple of single-node experiments. One tweet from @TheAhmadOsmanTheAhmadOsmanTheAhmadOsmanTheAhmadOsman in early June said V4-Flash was running on his 4× Spark, but no recipe and no numbers were ever posted. Searched everywhere — nobody had published a verified 4× Spark recipe for V4-Flash with tok/s before this thread.

It works. 49.4 tok/s single-stream on the canonical Python-function probe, peaking at 54.4 on the longer reasoning ones. With max-num-seqs tuned up (see below), concurrent throughput at n=8 hits 180 tok/s aggregate with peak generation 207 tok/s. Here’s exactly what I ran, and the three traps that burned me four hours before I figured out the actual fix was a single line in hazyumps’ own BUILD.md that I’d glossed over.

TL;DR

  • 4× DGX Spark (GB10 / sm_121a), 128 GiB UMA per node, ConnectX-7 RoCE through MikroTik 200 GbE
  • vLLM from jasl/vllm fork (PR #41834) + hazyumps’ sm12x_deep_gemm_fallbacks.py patch bind-mounted
  • TP=4, EP enabled, MTP n=2, FP8 KV, 384K context, full+piecewise CUDA graphs
  • Single-stream: 49.4 tok/s canonical, 54.4 peak on reasoning probes
  • conc8: 180 tok/s aggregate, peak generation 207 tok/s (with max-num-seqs=8 — see tuning section)
  • The one fix that mattered: NCCL 2.30.4. 2.28.9 hard-wedges every long generation.

Hardware + software

4× DGX Spark (GB10, sm_121a), 128 GiB unified LPDDR5x per node. ConnectX-7 RoCE NICs (visible as rocep1s0f0 via ibv_devices). MikroTik CRS520 200 GbE switch.

Container built from jasl/vllm fork (codex/ds4-sm120-min-enable branch / PR #41834 line), vLLM tag v20260613.dev297+ga93b9098b, CUDA 13.0 sbsa base aarch64. Critical addition: overlayed the container with libnccl2=2.30.4-1+cuda13.2 because the CUDA 13.0 base ships 2.28.9 by default. This was the whole thing.

Hazyumps’ sm12x_deep_gemm_fallbacks.py patch is bind-mounted over /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py — live-restartable, no rebuild needed to update.

Model is deepseek-ai/DeepSeek-V4-Flash (official FP8 weights), 149 GB, staged to /home/chris/models/DeepSeek-V4-Flash on each Spark. RoCE rendezvous on the 10.100.96.x range, master at port 29519, NCCL bootstrap over the same interface.

Cache mounts (/root/.cache/vllm and /root/.triton/cache) bind-mounted from host directories — required, not optional. See Trap 3.

The NCCL 2.30.4 overlay (the one that actually mattered)

Hazyumps’ BUILD.md says “NCCL 2.30.4 present in the image.” That’s the target — your build needs to actually pin it. CUDA 13.0 sbsa resolves libnccl2 to 2.28.9 unless you say otherwise. And you won’t see 2.30.4 in apt list until you run apt-get update to refresh the local cache — the build image ships with empty apt lists, so a casual check makes it look like 2.28.9 is all that’s available.

Overlay Dockerfile (~50 seconds to build):

FROM vllm-ds4-sm121:cu130

RUN apt-get update && \
    apt-get install -y --allow-downgrades --allow-change-held-packages \
      libnccl2=2.30.4-1+cuda13.2 \
      libnccl-dev=2.30.4-1+cuda13.2 && \
    rm -rf /var/lib/apt/lists/*

Build, tag, distribute to all 4 nodes (I used docker save | docker load over QSFP). After this, hazyumps’ LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libnccl.so.2.30.4 env var actually finds the lib. Without the overlay, that LD_PRELOAD fails with a warning that’s easy to miss in the cold-start log noise (“cannot be preloaded: ignored.”) and you have no idea you’re still running 2.28.9.

Verify after launch with docker logs vllm-ds4 2>&1 | grep "NCCL version" — should show NCCL version 2.30.4+cuda13.2.

The launch

Same docker invocation across all 4 nodes — only --node-rank and VLLM_HOST_IP differ, workers add --headless. I’ll describe the load-bearing pieces in prose; the full launch script is ~90 lines and mostly mirrors hazyumps’ 2-node template.

RDMA wiring. Three things: NCCL env vars for the RoCE HCA (NCCL_IB_HCA=rocep1s0f0, NCCL_IB_DISABLE=0), matching NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME for the same interface, and on the docker command --device=/dev/infiniband plus --cap-add=IPC_LOCK plus --ulimit memlock=-1:-1. The LD_PRELOAD to libnccl.so.2.30.4 pins the NCCL version (only works after the overlay above).

Memory and compile. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (reduces unified-memory fragmentation on the big KV allocs), TORCH_CUDA_ARCH_LIST=12.1a, VLLM_TRITON_MLA_SPARSE=1, VLLM_ALLOW_LONG_MAX_MODEL_LEN=1.

Mounts. Model directory, the sm12x_deep_gemm_fallbacks.py patch bind-mounted over the in-image path, and the two cache directories. Make the cache directories on the host before docker run.

vLLM flags worth calling out. TP=4, PP=1, nnodes=4, --enable-expert-parallel, --distributed-executor-backend mp, --master-addr pointed at the head’s RoCE IP on port 29519. KV cache: --kv-cache-dtype fp8, --block-size 256, --enable-prefix-caching. Context: --max-model-len 393216 (model max is 1M; I kept hazyumps’ value). Compute: --max-num-seqs 8 (was 2 in the conservative baseline, see tuning), --max-num-batched-tokens 4096, --gpu-memory-utilization 0.80. CUDA graphs full+piecewise via --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'. MTP via --speculative-config '{"method":"deepseek_mtp","num_speculative_tokens":2}'. Parsers: --reasoning-parser deepseek_v4 and --tool-call-parser deepseek_v4 plus --enable-auto-tool-choice. Standard --load-format safetensors, --no-enable-flashinfer-autotune (saves a 10+ minute startup autotune), --trust-remote-code, --tokenizer-mode deepseek_v4.

One gotcha worth repeating before you copy-paste hazyumps’ start scripts: the jasl/vllm image’s ENTRYPOINT is [vllm serve]. If you keep her "$IMAGE" vllm serve "$MODEL" invocation verbatim, you’ll double-stack and get vllm: error: unrecognized arguments: serve /models/DeepSeek-V4-Flash. Drop the leading vllm serve from your docker command. Workers add --headless. Head adds --host 0.0.0.0 --port 8000.

Cold boot is ~3 minutes from docker run to “Application startup complete.” First inference call takes ~30 seconds extra while Triton JITs compile shapes the warmup didn’t cover; subsequent calls are hot.

Results

Locked probe battery, temperature 0.3, single-stream means API quiescent, all run after the first warmup call primed the kernel cache.

Probe tokens wall tok/s finish
c1 spec→code (JSONL filter) 897 18.15s 49.4 stop
c2 multi-file project 2,339 45.44s 51.5 stop
c3 bugfix (threading) 504 9.35s 53.9 stop
c4 refactor 378 7.72s 49.0 stop
o1 customer email 398 9.32s 42.7 stop
o3 JSON extraction 522 9.91s 52.7 stop
l1 probability 1,195 21.97s 54.4 stop
l2 logic puzzle 2,048 37.63s 54.4 length
t1 tool calling 90 7.91s 11.4 tool_calls
conc4 (max-num-seqs=8) 1,142 total 16.13s 70.8 agg all stop
conc8 (max-num-seqs=8) 4,096 total 22.77s 179.9 agg (peak gen 207) all length

Cold sanity probe (cache empty, JIT during inference) ran at 33.5 tok/s. After the first request warmed the cache into the mounted volume, every single-stream call lives at 49–54.

MTP draft acceptance rate ran 0.69–0.77 mean across probes. Hazyumps reports closer to 0.80 on 2-Spark — possibly something I’m missing in draft-model warmup; haven’t chased it.

The t1 tool-call rate (11.4 tok/s) is ~1/5 of text rate. Same pattern I’ve seen on GLM-4.7 and Qwen3.5-397B on this cluster — structured output doesn’t get the spec-decode benefit. Expected, not a problem.

The conc4/conc8 numbers above are after tuning max-num-seqs=8. The baseline value from hazyumps’ 2-Spark recipe is 2, and at that setting conc4 was 37 agg, conc8 was 84 agg. The tuning lift is real and substantial — see below.

Tuning: max-num-seqs lift

Hazyumps validated max-num-seqs=2 on 2-Spark. With 4 nodes there’s headroom; I tried 4 and 8 single-variable.

Setting conc4 agg conc8 agg peak gen
baseline (2) 37.2 84.3 87.5
max-num-seqs=4 62.3 109.9 138.8
max-num-seqs=8 70.8 179.9 206.7

conc8 head log during the =8 run showed Running: 8 reqs, Waiting: 0 reqs — vLLM was actually running all 8 in parallel, no queue. 4× DGX Spark with TP=4 + MTP + RDMA has comfortable concurrency headroom on this model.

Traps that cost me time

Trap 1: NCCL 2.28.9 with TP=4 wedges every long generation

Headline trap. Symptoms: sanity probe with short generation works clean at ~33 tok/s. First long generation (anything producing >500 tokens of reasoning trace) decays generation throughput to 0.0 after ~30 seconds. Engine logs show shm_broadcast: No available shared memory broadcast block found in 60 seconds markers spaced 60s apart, then TimeoutError: RPC call to sample_tokens timed out, then EngineDeadError. Container restarts. Workers don’t — they zombie with broken-pipe loops to the dead head’s TCPStore. Have to kill+restart all 4 nodes by hand.

I burned three full restart cycles before catching this. The symptom looks like a JIT compile hang — and there ARE Triton compiles happening during inference (warmup doesn’t cover all shapes). But that’s the visible symptom, not the cause. 2.30.4’s collective primitives synchronize TP workers across the compile-latency divergence; 2.28.9 doesn’t. With 2.30.4 + LD_PRELOAD in place, the same JIT compiles still happen during inference, but they complete in ~1 second per kernel, workers stay in sync, the request finishes cleanly.

Hazyumps’ TUNING.md literally says “The NCCL 2.30.4 upgrade eliminates the original hard wedges.” They were right. I just didn’t believe it was load-bearing until I’d ruled out everything else.

Trap 2: The container ENTRYPOINT doubles your command

The jasl/vllm image’s ENTRYPOINT is [vllm serve]. If you copy hazyumps’ start scripts verbatim, your "$IMAGE" vllm serve "$MODEL" becomes vllm vllm serve $MODEL and you get vllm: error: unrecognized arguments: serve /models/DeepSeek-V4-Flash.

Fix: drop the leading vllm serve from the docker command. Just pass "$IMAGE" "$MODEL" $FLAGS. About ten minutes from first launch to noticing. Easy to skip if you grep your error.

Trap 3: Cache mounts are required, not optional

Hazyumps mentions this in passing in TUNING.md: “triton-cache + vllm-cache volume mounts persist compiled kernels across restarts (else a recompile ‘hang’ every boot).”

I read “every boot” and thought “well I’m not rebooting much, I’ll add them later.” Wrong. Even without rebooting, the JIT compile on the first inference call after a container restart causes the exact same wedge as Trap 1 if your NCCL isn’t 2.30.4. And once you ARE on 2.30.4, the cache mounts are how you make sure subsequent restarts don’t repeat the cold-cache hit. Make the host directories first, then mount them at the standard paths inside the container.

Bonus: --runtime nvidia if your docker doesn’t have it installed

Hazyumps’ scripts use --runtime nvidia --gpus all. If your Spark’s docker doesn’t have the nvidia runtime registered, you get Error response from daemon: unknown or invalid runtime name: nvidia in 90-point bold. --gpus all alone works on my Sparks. Easy fix, caught immediately.

What’s still on the table

  • max-num-batched-tokens past 4096. vLLM literally warned me about this — speculative-decoding setting under-saturates the scheduler at 4096.
  • max-model-len past 393216. Model’s max is 1M; I kept it at 384K to match hazyumps. The cluster has the unified-memory headroom for more.
  • MTP acceptance rate tuning. Hot cache gives 0.69–0.77; hazyumps reports ~0.80 on 2-Spark. Possibly a draft-model warmup or compilation-shape thing.

Credit

This recipe is almost entirely hazyumps’ work scaled up by one factor. Specifically:

  • hazyumps/deepseek-v4-flash-gb10 — 2× Spark recipe, BUILD.md, TUNING.md, NETWORK.md, and the sm12x_deep_gemm_fallbacks.py patch. Everything load-bearing came from this. The “wedge vs. recoverable stall” calibration in TUNING.md saved me at least two rounds of unnecessary debugging.
  • jasl/vllm fork (PR #41834 / codex/ds4-sm120-min-enable) — the sm_12x DeepSeek-V4 substrate. Without this, upstream vLLM crashes on load@TheAhmadOsmanTheAhmadOsmanon GB10.
  • @TheAhmadOsman’s tweet in early June saying V4-Flash was running on his 4× Spark — first public claim that 4-node was possible. No recipe came with it, but it’s the reason I tried in the first place.

If you’re in this space and haven’t read hazyumps’ TUNING.md, do that before this post.

Sources

Happy to answer questions.


Thanks for making your post more readable.