First - credits where credits due to post by @tonyd615 and @11_p who pointed me to a right direction: DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - #174 by 11_p as more so to original creator of recipe @aidendle94 and reddit post Reddit - Please wait for verification
By doing this post I am eating my recent words that we haven’t seen 1M operational sessions on Spark besides Nemotrons, now we are. Proven, tested, operational. Massive breakthrough.
================== Full write-up =================
DeepSeek V4 Flash at 1M Context on Dual DGX Spark/Atom AI Top — Working Recipe
After a week of trial and error, I finally have a stable DeepSeek V4 Flash deployment running at 1M context across two DGX Spark (Gigabyte Atom AI Top) nodes with b12x MoE kernels. Sharing the recipe and benchmarks so others don’t hit the same dead ends I did.
TL;DR
- Image:
aidendle94/sparkrun-vllm-ds4-gb10:production-ready— pre-built with b12x MoE, CUDA 12.1, vLLM 0.21.1 - Stack: Docker Compose, TP=2, no Ray, PyTorch distributed backend
- Result: 30-45 t/s decode, 1M context, zero fabric delays, 89/100 tool-calling
Hardware
- 2x DGX Spark / Gigabyte Atom AI Top (GB10, SM121, 128GB unified memory each)
- ConnectX-7 200Gbps RoCE direct cable (QSFP56)
- Head and worker on same subnet (192.168.0.0/24)
Dead Ends Avoided
- lmxxf/vllm-deepseek-v4-dgx-spark: FP4 Marlin backend, not compatible with FP8 weights
- Manual PR 40082 build (
--apply-vllm-pr 40082on vLLM main): FlashInfer/cutlass version mismatch →AttributeError: module 'cutlass.cute.nvgpu' has no attribute 'OperandMajorMode' - Standard spark-vllm-docker builds (dsv4-d568-cherry-sched): No b12x MoE support —
VLLM_USE_B12X_MOE=1env var was unrecognized, 2x slower prefill
Docker Compose Configuration
# compose.yaml
services:
vllm:
image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready
network_mode: host
ipc: host
shm_size: "64gb"
ulimits:
memlock: -1
stack: 67108864
gpus: all
devices:
- /dev/infiniband:/dev/infiniband
volumes:
- ${HF_CACHE:-${HOME}/.cache/huggingface}:/cache/huggingface
- /etc/passwd:/etc/passwd:ro
- /etc/group:/etc/group:ro
environment:
HF_HOME: /cache/huggingface
HF_HUB_OFFLINE: "1"
VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_USE_B12X_MOE: "1"
VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "256"
VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/.../libnccl.so.2
TORCH_CUDA_ARCH_LIST: "12.1a"
FLASHINFER_CUDA_ARCH_LIST: "12.1a"
NCCL_NET: IB
NCCL_IB_DISABLE: "0"
NCCL_IB_HCA: "rocep1s0f0,roceP2p1s0f0" # EDIT: Your NICs
NCCL_SOCKET_IFNAME: "enP7s7,enp1s0f0np0" # EDIT: Your NICs
NCCL_IB_GID_INDEX: "3"
NCCL_CROSS_NIC: "1"
NCCL_CUMEM_ENABLE: "0"
NCCL_IGNORE_CPU_AFFINITY: "1"
NCCL_DEBUG: WARN
NODE_RANK: "${NODE_RANK:?set 0 on head, 1 on worker}"
HEADLESS: "${HEADLESS:-}"
MASTER_ADDR: "${MASTER_ADDR:?head-node IP}"
command:
- bash
- -lc
- >
exec /usr/local/bin/dsv4-vllm-entrypoint serve deepseek-ai/DeepSeek-V4-Flash
--served-model-name deepseek-v4-flash
--host 0.0.0.0 --port 8000 --trust-remote-code
--tensor-parallel-size 2 --pipeline-parallel-size 1
--kv-cache-dtype fp8 --block-size 256
--max-model-len 1000000 --max-num-seqs 6
--max-num-batched-tokens 8192 --gpu-memory-utilization 0.82
--enable-prefix-caching
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
--tokenizer-mode deepseek_v4
--distributed-executor-backend mp
--tool-call-parser deepseek_v4 --enable-auto-tool-choice
--reasoning-parser deepseek_v4
--reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"","reasoning_end_str":""}'
--default-chat-template-kwargs.thinking=true
--default-chat-template-kwargs.reasoning_effort=high
--enable-flashinfer-autotune
--nnodes 2 --node-rank ${NODE_RANK}
--master-addr ${MASTER_ADDR} --master-port 25000
${HEADLESS:+--headless}
Per-Node .env Files
Head node (rank 0)
NODE_RANK=0
HEADLESS=
MASTER_ADDR=192.168.0.8 # EDIT: Your head node CX7 IP
HF_CACHE=/home/user/.cache/huggingface
Worker node (rank 1)
NODE_RANK=1
HEADLESS=1
MASTER_ADDR=192.168.0.8 # EDIT: Your head node CX7 IP (same as head)
HF_CACHE=/home/user/.cache/huggingface
Launch Order
Worker first, then Head.
# Terminal 1 — Worker node
cd /home/user/ds4f-aiden-docker
docker compose up -d
# Wait ~10s, then Terminal 2 — Head node
cd /home/user/ds4f-aiden-docker
docker compose up -d
NCCL Fix
Critical: Without shm_size: 64gb and ulimits memlock=-1, you’ll hit:
NCCL error: unhandled system error
Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory
This is a locked memory limit issue. The shm_size and memlock settings fix it.
Performance Results
Single Request (pp1024, tg128)
| Context | Prefill t/s | Decode t/s | TTFT |
|---|---|---|---|
| 0 | 1,188 | 45.7 | 1s |
| 240K | 1,710 | 39.4 | 2.4m |
| 384K | 1,510 | 36.4 | 4.3m |
| 512K | 1,374 | 36.1 | 6.2m |
| 720K | 1,187 | 35.0 | 10.1m |
| 980K | 986 | 30.4 | 16.6m |
Key observations:
- MLA KV cache is ~2% of GQA — zero fabric delays even at 980K
- No YaRN tax — decode drops only 23% from 0 to 980K (45→30 t/s)
- For comparison, Qwen 3.5-122B at 256K+: decode collapses from 40→15 t/s
Concurrency (pp2048, tg128)
| Config | Depth | Prefill t/s | Decode t/s | TTFT |
|---|---|---|---|---|
| c1 | d0 | 1,942 | 36.5 | 1.2s |
| c2 | d0 | 1,843 | 54.4 | 2.2s |
| c4 | d0 | 1,883 | 47.8 | 3.7s |
| c1 | d4K | 2,090 | 38.5 | 3.1s |
| c2 | d4K | 2,028 | 38.3 | 5.0s |
Sweet spot: c2 at zero context yields 54.4 t/s decode.
Tool-Calling Quality
tool-eval-bench v1.8.0 — 89/100
| Category | Score |
|---|---|
| Tool Selection | 100% |
| Parameter Precision | 83% |
| Multi-Step Chains | 75% |
| Restraint & Refusal | 100% |
| Error Recovery | 100% |
| Structured Output | 100% |
| Overall | 89/100 |
IFEval
- Instruction-level: 88.2%
- Prompt-level: 84.8%
Acknowledgements
- Original Docker image by aidendle94 — thank you for the pre-built b12x image
- Built on the shoulders of: jasl/vllm fork, lukealonso/b12x kernels, eugr/spark-vllm-docker toolchain
- NVIDIA Developer Forums community for debugging NCCL issues