I didn’t create this recipe you guys did but I was finally able to find it and get Deepseek v4 Flash working with 200k Context on 2 Nodes.
Sharing this since I couldn’t find a confirmed end-to-end recipe for the official DeepSeek-V4-Flash on a 2-node Spark setup, and there was a lot of “nobody has it on 2 nodes yet” floating around. It works. Here’s exactly what I ran.
Setup:
- 2x DGX Spark (GB10), 128GB unified each
- Direct QSFP56 200G cable between them (RoCE/NCCL over the CX-7), link-local addressing
- No Ray. TP=2 with --distributed-executor-backend mp, --nnodes 2
This is built on @eugr @eugr_nv eugr/spark-vllm-docker PR #219 (DeepSeek V4 Flash recipe) + the @jasl jasl/vllm fork. Full credit to them — I just got it stood up and verified on real hardware. Note PR #219 is still open/unmerged.
Build (the one thing to get right: pin the vLLM commit, don’t use a branch alias — only the pinned commit has the GB10 validation behind it):
./build-and-copy.sh \
–vllm-repo GitHub - jasl/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs · GitHub \
–vllm-ref dda4668b59567416f86956cfe7bbc1eab371a61e \
–rebuild-vllm -t vllm-node-dsv4 -c
Launch (from the head node):
DOTENV_CONTAINER_NAME=vllm_ds4 nohup ./run-recipe.sh \
deepseek-v4-flash --no-ray --tp 2 --name vllm_ds4 > ds4.log 2>&1 &
Key flags the recipe sets: official deepseek-ai/DeepSeek-V4-Flash (native FP8, E4M3 128x128 block, ~149GB/46 shards), --kv-cache-dtype fp8, --enable-expert-parallel, speculative deepseek_mtp num_speculative_tokens=2, --max-model-len 200000, --max-num-seqs 2, block-size 256, cudagraph FULL_AND_PIECEWISE.
Numbers I’m seeing (warm, single stream): ~44 tok/s decode. Concurrency=2 aggregate ~45 tok/s. TTFT on short prompts ~2s warm. Cold start container-to-serving was ~6 min. These line up with the jasl GB10 validation baseline (conversational c=1 ~35 t/s, scaling to ~96 t/s aggregate at c=8, MTP spec-accept ~68% on conversational).
Gotchas that cost me time:
- The “Pin NCCL” commit in PR #219 matters — it symlinks the system libnccl; without the current PR head the cross-node init isn’t right.
- build-and-copy’s image copy mangled the worker user for me (double user@). Worked around it with a plain docker save | ssh worker docker load over the link.
- max_num_seqs=2 is intentional at 200K ctx (KV budget). If you want more concurrency, drop max-model-len (the validated profiles do 65K@16seqs, 32K@36seqs).
- Long-context cold prefill is the weak spot: ~53s TTFT at 32K, ~250s at 128K. Fine for normal prompts, rough for huge contexts.
- One of my CX-7 links wedged during teardown churn (mlx5 ACCESS_REG timeout); a clean cold boot cleared it, nothing else did.
Hope it saves someone the night I just spent. Curious if anyone’s pushed concurrency or long-ctx prefill further on GB10.
