Hey everyone,
I wanted to share a repository I put together after spending several weeks getting DeepSeek-V4-Flash FP8 running on two DGX Spark GB10 units.
The main point of the repo is not to publish a new model or a vLLM fork. It is a reproducible serving recipe for people trying to run DeepSeek-V4-Flash on GB10 / SM121 today, including the build, launch, memory, networking, and stability details that were not obvious when I started.
The core problem:
Stock vLLM does not yet provide a simple, stable “it just works” path for DeepSeek-V4-Flash on GB10 / SM121. The current working route depends on the SM12x enablement work from an upstream vLLM PR. That PR adds the missing SM120/SM121 model and kernel support, plus fallback paths for cases where SM100-only or unreleased dependency paths are not usable on GB10.
What the recipe does:
- Builds a GB10 / SM121-compatible vLLM image from the relevant upstream PR branch.
- Provides launch templates for 2x DGX Spark with tensor parallelism over RoCEv2 / ConnectX networking.
- Includes two profiles:
- 1M context for maximum context length, with low sequence concurrency.
- 256K context for better aggregate throughput.
- Documents GB10-specific UMA behavior. On GB10, model weights, KV cache, CUDA graphs, and the rest of the process share the same unified memory pool, so memory tuning matters much more than on classic separate-VRAM setups.
- Documents the practical failure modes I hit or had to design around: KV-cache pressure, MTP speculative decoding issues, Marlin / MoE behavior, CUDA graph sensitivity, and long-context stability limits.
- Includes benchmark numbers and validation gates so others can compare their own setup.
Important clarification:
The repo does not claim to fix a NVIDIA driver or firmware issue. It also does not distribute model weights, CUDA libraries, or binaries.
There is a related Blackwell GSP hard-hang issue tracked publicly in NVIDIA/open-gpu-kernel-modules #1111. That issue is on SM120 hardware, not GB10 / SM121, so I treat it as a related failure class rather than proof of the same root cause on DGX Spark. For that reason the repo includes conservative long-context guidance and recommends soak testing before treating 1M context as production-ready.
Quick numbers from my setup:
- 1M context: around 37 tok/s single-stream, around 100 tok/s aggregate, max seqs 6.
- 256K context: around 40 tok/s single-stream, around 150 tok/s aggregate, max seqs 24.
I would be interested in feedback from anyone running DeepSeek-V4-Flash on GB10 / SM121, especially if you have additional logs, stability results, or cleaner workarounds for the SM12x vLLM path.