DeepSeek V4 Flash MXFP4 proof-of-life on a single GB10/GX10 (128 GB unified)
Hi all,
I wanted to share an early technical proof-of-life for running DeepSeek V4 Flash (43L, MoE, MLA) in MXFP4 on a single NVIDIA GB10/GX10-class system with 128 GB unified memory.
This is not a polished release and not a vLLM replacement. It is a custom experimental C++ runtime focused on validating whether the model can run correctly end-to-end on a single-node GB10 setup, with no sharding, no pipeline parallelism, and no tensor parallelism — all 43 layers, 256 routed experts per layer, KV cache, and workspace resident in unified memory.
Hardware / Runtime
-
NVIDIA GB10 / GX10-class system, single node
-
128 GB unified memory (CPU+GPU)
-
CUDA 13.x, sm_121a
-
Custom C++ MXFP4 runtime (no vLLM, no llama.cpp)
-
43-layer DeepSeek V4 Flash forward path
-
Routed MoE experts via hot-set cache with mmap-backed cold storage
-
Multi-head Latent Attention (MLA)
-
YARN / compressed RoPE
-
Same-process multi-prompt smoke test
-
Weights pulled from the public HuggingFace DSv4 Flash safetensors snapshot
Current correctness status
The runtime now passes the main correctness and lifecycle gates:
Forced-token R2-vs-C++ equivalence harness. Each decode step, the C++ engine is fed the previous token from the Python reference forward pass (R2), not its own output, and the predicted argmax + full logits are compared:
-
8/8 forced-token argmax match against R2
-
logits cosine: 0.9994–0.9999, max abs error 0.6–1.1
Per-layer R2-vs-C++ cosine bisect. Each intermediate tensor (HC pre/post, RMS, MLA, MoE, post-FFN) compared layer-by-layer:
- No
first_bad_layerfound in 0..42 with all bug fixes applied (HC broadcast, W2/W3 enum, YARN ramp)
Same-process multi-prompt smoke (max_new=16, temp=0 greedy):
-
5/5 prompts complete without OOM/kill, single-process
-
3/5 intelligible
Example outputs (greedy decode):
-
The capital of France is→Paris. It is one of the most famous cities in the world. It is -
Ciao, come stai?→Spero tutto bene. Oggi voglio parlarti di un argo... -
Hello→ coherent English template-style continuation (ERB syntax)
Bugs fixed during bring-up
A few non-trivial issues had to be resolved to get from token noise to coherent output:
1. HC residual broadcast. The token embedding initially populated only HC slot 0 of the [N_HC=4, HIDDEN] residual stream. The other 3 slots were garbage, which propagated into the HyperConnection mixing block and produced orthogonal hidden states from layer 0 onward. Fixed by broadcasting embed slot 0 across all HC slots.
2. MoE W2/W3 ordering. The C++ enum mapping {W1=0, W3=1, W2=2} did not match the Python loader’s registration order (w1, w2, w3). The routed expert output was nearly orthogonal (cos ≈ 0) before the fix. Re-ordering to {W1=0, W2=1, W3=2} was a one-line fix that brought routed expert cosine to 0.999989 and step-0 argmax to the correct token.
3. YARN / compressed RoPE. The C++ initially divided all compressed RoPE dimensions uniformly by the YARN factor. The DeepSeek reference uses a piecewise NTK ramp:
freqs = freqs/factor*(1-smooth) + freqs*smooth
where smooth = 1 - linear_ramp(low, high)
with low=15, high=25 for our config. After porting the exact ramp 1:1 from the Python reference, L2 q_post_rope cosine went from 0.921 to 0.999982 vs the reference, and downstream layers stopped diverging.
4. mmap / hot-set memory pressure. Same-process smoke originally died around prompt 3. Root cause was routed-expert cold-loads touching mmap-backed safetensors pages and creating page-cache + swap pressure (≈15 GB of routed pages touched per prompt at top-32 hot-set). Fixed with madvise(MADV_DONTNEED) on interior pages after H2D copy completion. Same-process smoke now plateaus at MemAvailable ≈ 33–40 GB and SwapFree stable across all 5 prompts.
5. Cold-load async infrastructure. Added an event-based async cold-load path: per-slot cudaEvent_t, cudaStreamWaitEvent from the consumer stream onto the prefetch stream, and cudaLaunchHostFunc for the deferred madvise so the decode thread never blocks on copy completion. Then collapsed the per-expert event/wait/host-function fan-out down to one event + one wait + one host-function per pack call (top-K = 6 experts batched together).
Outputs remained bit-identical across these infrastructure changes.
Current performance snapshot
This is still an active optimization target, not a final benchmark. Cumulative effect of the async + batched cold-load infrastructure (D27 same-process baseline, top-32, max_new=16, 5 greedy prompts):
-
prompts_total wall: −6.1 %
-
effective decode TPS (excl. init/prewarm): +4.7 %
-
end-to-end TPS (incl. init/prewarm): +5.7 %
Throughput is currently decode-bandwidth-bound by routed-expert cold-loads from mmap, not by compute. Total cold-load miss volume is unchanged across these changes — they reduce the per-miss coordination cost, not the H2D bytes.
Empirical observations from a per-layer miss histogram (5 prompts × 16 decode tokens, top-32 prewarm):
-
6,110 routed-expert cold-loads over the run
-
top-10 layers cover only 28.8 % of misses (vs 23 % uniform, 50 % concentrated)
-
spread max/min ≈ 2.7× (most-missed L1: 257 misses; least-missed L16: 96)
-
routing is dispersed both cross-layer within a prompt and dynamically across prompts
This means static prewarm scaling does not help further (verified: top-64 prewarm raises memory cost 18 GB and prewarm time 60 % while reducing miss count by only 1.5 %). The next perf gain has to come from prefetch/overlap, not from a larger static hot-set.
Current direction:
-
keep top-32 default
-
avoid larger static prewarm
-
investigate true prefetch/overlap and predictive prefetch for routed-expert misses
-
no MTP, no CUDA Graph yet
-
math/kernel code frozen — every infrastructure change is gated on the forced-token 8/8 argmax match
Limitations
-
No public release package yet
-
No OpenAI-compatible server yet
-
Performance numbers above are deltas, not steady-state; absolute TPS is not yet competitive
-
No CUDA Graph
-
No MTP / speculative decode
-
No request batching
-
Hot-set routing policy is still LRU + static prewarm
-
Model weights are not included
Why I’m posting
I mainly wanted to confirm publicly that DeepSeek V4 Flash MXFP4 can be made to run correctly on a single GB10/GX10-class node, at least as an experimental runtime, and to compare notes with anyone else working on DSv4 / Blackwell sm_121a / unified-memory inference.
Happy to share more details on the failure modes, the forced-token bisect harness, and the cold-load lifecycle work if useful.
Update: the sanitized experimental repo is now public, with v0.1.0 tagged as a prerelease:
It includes the C++ runtime, build instructions, and the validation harnesses used for the GB10 proof-of-life. The release notes include the validation gates, bring-up history, known limitations, and validation provenance.
Questions for the community
-
Has anyone else validated DeepSeek V4 Flash on GB10/GX10 with CUDA 13.x?
-
Are there recommended patterns for mmap-backed expert cold-loads on unified-memory systems?
-
Any best practices for avoiding page-cache / swap pressure when staging large MoE expert banks (15+ GB of routed pages per prompt)?
-
Is anyone already working on vLLM / SGLang support for the DSv4 Flash path on sm_121a (tracker: vLLM #41063)?
Thanks.