[GUIDE] DeepSeek-V4-Flash on 2× DGX Spark (GB10) — Reproducible vLLM Serving Recipe up to 1M Token Context

alex.busse · June 27, 2026, 10:04am

Hey everyone,

I wanted to share a repository I put together after spending several weeks getting DeepSeek-V4-Flash FP8 running on two DGX Spark GB10 units.

The main point of the repo is not to publish a new model or a vLLM fork. It is a reproducible serving recipe for people trying to run DeepSeek-V4-Flash on GB10 / SM121 today, including the build, launch, memory, networking, and stability details that were not obvious when I started.

The core problem:
Stock vLLM does not yet provide a simple, stable “it just works” path for DeepSeek-V4-Flash on GB10 / SM121. The current working route depends on the SM12x enablement work from an upstream vLLM PR. That PR adds the missing SM120/SM121 model and kernel support, plus fallback paths for cases where SM100-only or unreleased dependency paths are not usable on GB10.

What the recipe does:

Builds a GB10 / SM121-compatible vLLM image from the relevant upstream PR branch.
Provides launch templates for 2x DGX Spark with tensor parallelism over RoCEv2 / ConnectX networking.
Includes two profiles:
1M context for maximum context length, with low sequence concurrency.
256K context for better aggregate throughput.
Documents GB10-specific UMA behavior. On GB10, model weights, KV cache, CUDA graphs, and the rest of the process share the same unified memory pool, so memory tuning matters much more than on classic separate-VRAM setups.
Documents the practical failure modes I hit or had to design around: KV-cache pressure, MTP speculative decoding issues, Marlin / MoE behavior, CUDA graph sensitivity, and long-context stability limits.
Includes benchmark numbers and validation gates so others can compare their own setup.

Important clarification:
The repo does not claim to fix a NVIDIA driver or firmware issue. It also does not distribute model weights, CUDA libraries, or binaries.

There is a related Blackwell GSP hard-hang issue tracked publicly in NVIDIA/open-gpu-kernel-modules #1111. That issue is on SM120 hardware, not GB10 / SM121, so I treat it as a related failure class rather than proof of the same root cause on DGX Spark. For that reason the repo includes conservative long-context guidance and recommends soak testing before treating 1M context as production-ready.

Quick numbers from my setup:

1M context: around 37 tok/s single-stream, around 100 tok/s aggregate, max seqs 6.
256K context: around 40 tok/s single-stream, around 150 tok/s aggregate, max seqs 24.

Repository: GitHub - GanyX19/deepseek-v4-1m-on-dgx-spark: Reproducible recipe: serve DeepSeek-V4-Flash with up to 1M token context on 2x NVIDIA DGX Spark (GB10) via vLLM (TP=2). Build (sm_121), launch templates, hardware bring-up, known issues, benchmarks. No binaries/weights. · GitHub

I would be interested in feedback from anyone running DeepSeek-V4-Flash on GB10 / SM121, especially if you have additional logs, stability results, or cleaner workarounds for the SM12x vLLM path.

donoughliu · June 27, 2026, 11:07am

Extremely detailed building and deployment guide! I need this! Thanks for your great effort!

alex.busse · June 30, 2026, 3:17am

Following up on this guide with a significant finding after more debugging on our 2× DGX Spark (GB10).

TL;DR: the gradual UMA-OOM host hard-freeze on multi-node TP=2 (including what looked like the “GSP firmware hard-lock under sustained 1 M load”) is not primarily firmware or a too-high --gpu-memory-utilization. It’s a per-request memory leak in UCX, the inter-node RDMA transport vLLM uses under NCCL. Two env vars stop it:

UCX_MEM_MMAP_HOOK_MODE=none
UCX_RCACHE_MAX_UNRELEASED=1024

Mechanism: UCX hooks every mmap for its RDMA memory-registration cache. With the default UCX_RCACHE_MAX_UNRELEASED=inf the unreleased-region queue grows unbounded (~14 MB per request in our measurements), and on GB10 (GPU + host RAM are one unified pool) that drains the UMA until the host silently wedges. Same root cause Mistral documented for vLLM; it just lands harder on unified memory.

Evidence (A/B, 256 K, 5 concurrent, varied shapes):

Without the vars: free UMA fell 10 → 6 GB, OOM-aborted in ~18 min.
With the vars: flat, no growth, no abort, same load.
Why it looked like a “1 M-only firmware lock”: 1 M context has the tightest UMA headroom, so the same leak fills it fastest and freezes soonest. With the fix, 256 K, 512 K and 1 M all boot and serve cleanly here (1 M: KV pool 1.43 M tokens @ util 0.80, ~8 GB free).
Bonus hardening (recommended regardless): make a UMA-OOM recoverable instead of a host wedge:
sysctl -w vm.min_free_kbytes=3145728 # ~3 GB reserve
swapoff -a
so the kernel OOM-kills the process cleanly (your watchdog relaunches it) instead of starving itself into a freeze. NVIDIA has acknowledged the UMA-OOM-host-wedge as a known GB10 issue; this is a solid interim.
On --gpu-memory-utilization: lowering it only delayed the freeze (more room for the leak to eat), never the cause. With UCX fixed, util is bounded only by per-node UMA working-set headroom; ~0.80–0.82 is the practical max on GB10 (0.85 leaves only ~1 GB/node).
I’ve updated the repo with all of this: UCX vars in every serve script, a new 512 K middle-ground recipe, and rewritten known-issues:
👉 GitHub - GanyX19/deepseek-v4-1m-on-dgx-spark: Reproducible recipe: serve DeepSeek-V4-Flash with up to 1M token context on 2x NVIDIA DGX Spark (GB10) via vLLM (TP=2). Build (sm_121), launch templates, hardware bring-up, known issues, benchmarks. No binaries/weights. · GitHub
Honest caveat: I’ve A/B-load-tested 256 K and verified 1 M + 512 K boot/serve with the fix, but a >12 h sustained-saturation soak with the fix is still pending; if anyone runs one, please report back.
Hope this saves someone the week it cost me.

Topic		Replies	Views
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	260	22322	July 15, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	754	June 19, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	4259	May 17, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	7054	June 15, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1531	June 4, 2026
DeepSeek v4 Flash (IQ2XXS) on a single GB10! DGX Spark / GB10 Projects llm , llama , deepseek	13	4666	July 2, 2026
DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10 DGX Spark / GB10 deepseek	388	20298	July 20, 2026
DeepSeek V4 Flash (1,048,576 Context) on 2x DGX Spark – Custom Sparkrun Recipe DGX Spark / GB10 jetson , deepseek	11	1151	June 14, 2026
Instructions for running Deepseek-v4-flash with DSpark using Eugr's repo DGX Spark / GB10 Projects deepseek	10	1176	July 16, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	2187	May 11, 2026

[GUIDE] DeepSeek-V4-Flash on 2× DGX Spark (GB10) — Reproducible vLLM Serving Recipe up to 1M Token Context

Related topics