[GUIDE] DeepSeek-V4-Flash on 2× DGX Spark (GB10) — Reproducible vLLM Serving Recipe up to 1M Token Context

Hey everyone,

I wanted to share a repository I put together after spending several weeks getting DeepSeek-V4-Flash FP8 running on two DGX Spark GB10 units.

The main point of the repo is not to publish a new model or a vLLM fork. It is a reproducible serving recipe for people trying to run DeepSeek-V4-Flash on GB10 / SM121 today, including the build, launch, memory, networking, and stability details that were not obvious when I started.

The core problem:
Stock vLLM does not yet provide a simple, stable “it just works” path for DeepSeek-V4-Flash on GB10 / SM121. The current working route depends on the SM12x enablement work from an upstream vLLM PR. That PR adds the missing SM120/SM121 model and kernel support, plus fallback paths for cases where SM100-only or unreleased dependency paths are not usable on GB10.

What the recipe does:

  • Builds a GB10 / SM121-compatible vLLM image from the relevant upstream PR branch.
  • Provides launch templates for 2x DGX Spark with tensor parallelism over RoCEv2 / ConnectX networking.
  • Includes two profiles:
  • 1M context for maximum context length, with low sequence concurrency.
  • 256K context for better aggregate throughput.
  • Documents GB10-specific UMA behavior. On GB10, model weights, KV cache, CUDA graphs, and the rest of the process share the same unified memory pool, so memory tuning matters much more than on classic separate-VRAM setups.
  • Documents the practical failure modes I hit or had to design around: KV-cache pressure, MTP speculative decoding issues, Marlin / MoE behavior, CUDA graph sensitivity, and long-context stability limits.
  • Includes benchmark numbers and validation gates so others can compare their own setup.

Important clarification:
The repo does not claim to fix a NVIDIA driver or firmware issue. It also does not distribute model weights, CUDA libraries, or binaries.

There is a related Blackwell GSP hard-hang issue tracked publicly in NVIDIA/open-gpu-kernel-modules #1111. That issue is on SM120 hardware, not GB10 / SM121, so I treat it as a related failure class rather than proof of the same root cause on DGX Spark. For that reason the repo includes conservative long-context guidance and recommends soak testing before treating 1M context as production-ready.

Quick numbers from my setup:

  • 1M context: around 37 tok/s single-stream, around 100 tok/s aggregate, max seqs 6.
  • 256K context: around 40 tok/s single-stream, around 150 tok/s aggregate, max seqs 24.

Repository: GitHub - GanyX19/deepseek-v4-1m-on-dgx-spark: Reproducible recipe: serve DeepSeek-V4-Flash with up to 1M token context on 2x NVIDIA DGX Spark (GB10) via vLLM (TP=2). Build (sm_121), launch templates, hardware bring-up, known issues, benchmarks. No binaries/weights. · GitHub

I would be interested in feedback from anyone running DeepSeek-V4-Flash on GB10 / SM121, especially if you have additional logs, stability results, or cleaner workarounds for the SM12x vLLM path.

Extremely detailed building and deployment guide! I need this! Thanks for your great effort!

Following up on this guide with a significant finding after more debugging on our 2× DGX Spark (GB10).

TL;DR: the gradual UMA-OOM host hard-freeze on multi-node TP=2 (including what looked like the “GSP firmware hard-lock under sustained 1 M load”) is not primarily firmware or a too-high --gpu-memory-utilization. It’s a per-request memory leak in UCX, the inter-node RDMA transport vLLM uses under NCCL. Two env vars stop it:

UCX_MEM_MMAP_HOOK_MODE=none
UCX_RCACHE_MAX_UNRELEASED=1024

Mechanism: UCX hooks every mmap for its RDMA memory-registration cache. With the default UCX_RCACHE_MAX_UNRELEASED=inf the unreleased-region queue grows unbounded (~14 MB per request in our measurements), and on GB10 (GPU + host RAM are one unified pool) that drains the UMA until the host silently wedges. Same root cause Mistral documented for vLLM; it just lands harder on unified memory.

Evidence (A/B, 256 K, 5 concurrent, varied shapes):

  • Without the vars: free UMA fell 10 → 6 GB, OOM-aborted in ~18 min.
  • With the vars: flat, no growth, no abort, same load.
    Why it looked like a “1 M-only firmware lock”: 1 M context has the tightest UMA headroom, so the same leak fills it fastest and freezes soonest. With the fix, 256 K, 512 K and 1 M all boot and serve cleanly here (1 M: KV pool 1.43 M tokens @ util 0.80, ~8 GB free).
  • Bonus hardening (recommended regardless): make a UMA-OOM recoverable instead of a host wedge:
    sysctl -w vm.min_free_kbytes=3145728 # ~3 GB reserve
    swapoff -a
  • so the kernel OOM-kills the process cleanly (your watchdog relaunches it) instead of starving itself into a freeze. NVIDIA has acknowledged the UMA-OOM-host-wedge as a known GB10 issue; this is a solid interim.
  • On --gpu-memory-utilization: lowering it only delayed the freeze (more room for the leak to eat), never the cause. With UCX fixed, util is bounded only by per-node UMA working-set headroom; ~0.80–0.82 is the practical max on GB10 (0.85 leaves only ~1 GB/node).
    I’ve updated the repo with all of this: UCX vars in every serve script, a new 512 K middle-ground recipe, and rewritten known-issues:
  • 👉 GitHub - GanyX19/deepseek-v4-1m-on-dgx-spark: Reproducible recipe: serve DeepSeek-V4-Flash with up to 1M token context on 2x NVIDIA DGX Spark (GB10) via vLLM (TP=2). Build (sm_121), launch templates, hardware bring-up, known issues, benchmarks. No binaries/weights. · GitHub
  • Honest caveat: I’ve A/B-load-tested 256 K and verified 1 M + 512 K boot/serve with the fix, but a >12 h sustained-saturation soak with the fix is still pending; if anyone runs one, please report back.
    Hope this saves someone the week it cost me.