DeepSeek V4 Flash MXFP4 proof-of-life on a single GB10/GX10

DeepSeek V4 Flash MXFP4 proof-of-life on a single GB10/GX10 (128 GB unified)

Hi all,

I wanted to share an early technical proof-of-life for running DeepSeek V4 Flash (43L, MoE, MLA) in MXFP4 on a single NVIDIA GB10/GX10-class system with 128 GB unified memory.

This is not a polished release and not a vLLM replacement. It is a custom experimental C++ runtime focused on validating whether the model can run correctly end-to-end on a single-node GB10 setup, with no sharding, no pipeline parallelism, and no tensor parallelism — all 43 layers, 256 routed experts per layer, KV cache, and workspace resident in unified memory.

Hardware / Runtime

  • NVIDIA GB10 / GX10-class system, single node

  • 128 GB unified memory (CPU+GPU)

  • CUDA 13.x, sm_121a

  • Custom C++ MXFP4 runtime (no vLLM, no llama.cpp)

  • 43-layer DeepSeek V4 Flash forward path

  • Routed MoE experts via hot-set cache with mmap-backed cold storage

  • Multi-head Latent Attention (MLA)

  • YARN / compressed RoPE

  • Same-process multi-prompt smoke test

  • Weights pulled from the public HuggingFace DSv4 Flash safetensors snapshot

Current correctness status

The runtime now passes the main correctness and lifecycle gates:

Forced-token R2-vs-C++ equivalence harness. Each decode step, the C++ engine is fed the previous token from the Python reference forward pass (R2), not its own output, and the predicted argmax + full logits are compared:

  • 8/8 forced-token argmax match against R2

  • logits cosine: 0.9994–0.9999, max abs error 0.6–1.1

Per-layer R2-vs-C++ cosine bisect. Each intermediate tensor (HC pre/post, RMS, MLA, MoE, post-FFN) compared layer-by-layer:

  • No first_bad_layer found in 0..42 with all bug fixes applied (HC broadcast, W2/W3 enum, YARN ramp)

Same-process multi-prompt smoke (max_new=16, temp=0 greedy):

  • 5/5 prompts complete without OOM/kill, single-process

  • 3/5 intelligible

Example outputs (greedy decode):

  • The capital of France isParis. It is one of the most famous cities in the world. It is

  • Ciao, come stai?Spero tutto bene. Oggi voglio parlarti di un argo...

  • Hello → coherent English template-style continuation (ERB syntax)

Bugs fixed during bring-up

A few non-trivial issues had to be resolved to get from token noise to coherent output:

1. HC residual broadcast. The token embedding initially populated only HC slot 0 of the [N_HC=4, HIDDEN] residual stream. The other 3 slots were garbage, which propagated into the HyperConnection mixing block and produced ortho­gonal hidden states from layer 0 onward. Fixed by broadcasting embed slot 0 across all HC slots.

2. MoE W2/W3 ordering. The C++ enum mapping {W1=0, W3=1, W2=2} did not match the Python loader’s registration order (w1, w2, w3). The routed expert output was nearly orthogonal (cos ≈ 0) before the fix. Re-ordering to {W1=0, W2=1, W3=2} was a one-line fix that brought routed expert cosine to 0.999989 and step-0 argmax to the correct token.

3. YARN / compressed RoPE. The C++ initially divided all compressed RoPE dimensions uniformly by the YARN factor. The DeepSeek reference uses a piecewise NTK ramp:

freqs = freqs/factor*(1-smooth) + freqs*smooth
where smooth = 1 - linear_ramp(low, high)

with low=15, high=25 for our config. After porting the exact ramp 1:1 from the Python reference, L2 q_post_rope cosine went from 0.921 to 0.999982 vs the reference, and downstream layers stopped diverging.

4. mmap / hot-set memory pressure. Same-process smoke originally died around prompt 3. Root cause was routed-expert cold-loads touching mmap-backed safetensors pages and creating page-cache + swap pressure (≈15 GB of routed pages touched per prompt at top-32 hot-set). Fixed with madvise(MADV_DONTNEED) on interior pages after H2D copy completion. Same-process smoke now plateaus at MemAvailable ≈ 33–40 GB and SwapFree stable across all 5 prompts.

5. Cold-load async infrastructure. Added an event-based async cold-load path: per-slot cudaEvent_t, cudaStreamWaitEvent from the consumer stream onto the prefetch stream, and cudaLaunchHostFunc for the deferred madvise so the decode thread never blocks on copy completion. Then collapsed the per-expert event/wait/host-function fan-out down to one event + one wait + one host-function per pack call (top-K = 6 experts batched together).

Outputs remained bit-identical across these infrastructure changes.

Current performance snapshot

This is still an active optimization target, not a final benchmark. Cumulative effect of the async + batched cold-load infrastructure (D27 same-process baseline, top-32, max_new=16, 5 greedy prompts):

  • prompts_total wall: −6.1 %

  • effective decode TPS (excl. init/prewarm): +4.7 %

  • end-to-end TPS (incl. init/prewarm): +5.7 %

Throughput is currently decode-bandwidth-bound by routed-expert cold-loads from mmap, not by compute. Total cold-load miss volume is unchanged across these changes — they reduce the per-miss coordination cost, not the H2D bytes.

Empirical observations from a per-layer miss histogram (5 prompts × 16 decode tokens, top-32 prewarm):

  • 6,110 routed-expert cold-loads over the run

  • top-10 layers cover only 28.8 % of misses (vs 23 % uniform, 50 % concentrated)

  • spread max/min ≈ 2.7× (most-missed L1: 257 misses; least-missed L16: 96)

  • routing is dispersed both cross-layer within a prompt and dynamically across prompts

This means static prewarm scaling does not help further (verified: top-64 prewarm raises memory cost 18 GB and prewarm time 60 % while reducing miss count by only 1.5 %). The next perf gain has to come from prefetch/overlap, not from a larger static hot-set.

Current direction:

  • keep top-32 default

  • avoid larger static prewarm

  • investigate true prefetch/overlap and predictive prefetch for routed-expert misses

  • no MTP, no CUDA Graph yet

  • math/kernel code frozen — every infrastructure change is gated on the forced-token 8/8 argmax match

Limitations

  • No public release package yet

  • No OpenAI-compatible server yet

  • Performance numbers above are deltas, not steady-state; absolute TPS is not yet competitive

  • No CUDA Graph

  • No MTP / speculative decode

  • No request batching

  • Hot-set routing policy is still LRU + static prewarm

  • Model weights are not included

Why I’m posting

I mainly wanted to confirm publicly that DeepSeek V4 Flash MXFP4 can be made to run correctly on a single GB10/GX10-class node, at least as an experimental runtime, and to compare notes with anyone else working on DSv4 / Blackwell sm_121a / unified-memory inference.

Happy to share more details on the failure modes, the forced-token bisect harness, and the cold-load lifecycle work if useful.

Update: the sanitized experimental repo is now public, with v0.1.0 tagged as a prerelease:

It includes the C++ runtime, build instructions, and the validation harnesses used for the GB10 proof-of-life. The release notes include the validation gates, bring-up history, known limitations, and validation provenance.

Questions for the community

  1. Has anyone else validated DeepSeek V4 Flash on GB10/GX10 with CUDA 13.x?

  2. Are there recommended patterns for mmap-backed expert cold-loads on unified-memory systems?

  3. Any best practices for avoiding page-cache / swap pressure when staging large MoE expert banks (15+ GB of routed pages per prompt)?

  4. Is anyone already working on vLLM / SGLang support for the DSv4 Flash path on sm_121a (tracker: vLLM #41063)?

Thanks.

You pay 3000 dollars for a GX10.

If you get it used, let’s say you get away with 500 minus.

A day of intense work, 24 hours in which the AI ​​does not stop, consumes about 10 dollars.

10 dollars goes 250 times into 2500

In April-May 2027, let’s not forget that we will be able to run the equivalent of GPT 6 locally without problems.

Investing 2500 directly in cash or giving the AI ​​to trade with them and pay for itself, which choice is better?

If you buy the board, take it in installments, try to give as little money as possible at once and multiplications.

Let me explain another view of token costs. We have employees on for 8-9 hour shifts burning through $200+ a day in tokens each… This is when it makes sense to localize your LLM models.

Even on the most stable forks performance is not usable yet. RTX 6000 Pro users are seeing massive prefill slowdowns.

The prefill bottleneck on DSv4 Flash is structural, not hardware-specific.
The shape of the problem on the prefill path: 43 transformer layers × 256
routed experts top-K=6, plus the 64-head indexer with top_k=512. Even on
RTX 6000 Pro you pay cold-load expert weights and the indexer scan once
per request.

Since the v0.1.0 push I added a KV cache on-disk layer to the engine, on
a single GB10:

  • Format: 48-byte header + raw BF16 layer-major payload, sized as
    n_layers × seq_len × kv_lora_rank × sizeof(bf16).
  • Cache key: SHA1 over (model id + tokenizer hash + prompt token IDs +
    runtime/config version). Any change in tokenizer, weights, or runtime
    config invalidates entries automatically.
  • Small JSON sidecar carrying the first generated token, since the KV
    snapshot doesn’t include the last-prompt-token logits. Without it you’d
    pay an extra forward pass on hit.
  • Two modes: exact-prompt cache (full skip on identical re-request) and
    stable-prefix cache (load prefix KV, prefill only the user-turn suffix
    on a multi-message chat).

Measured on GB10, prompt “The capital of France is”:

Request hit kind prefill_s decode_s


first miss 168.78 32.20
second exact 0.00 1.25

For multi-message chat with reused system/tool spec but a different user
turn, the prefix-cache path measured ~8 s of suffix-only prefill against
~170 s full prefill on the same engine.

Caveats: this doesn’t change steady-state decode tok/s, and it doesn’t
reduce the first-request prefill cost. It only kills repeat cost. For
Claude-Code-style or agent loops where ~25k tokens of system prompt get
resent every turn, that’s typically the dominant practical cost.

The pattern is additive: no kernel/math/quantization changes, so it
should port cleanly to llama.cpp / vllm forks. The non-obvious bit is
the BPE-boundary clip when splitting a multi-message prompt into a
stable prefix: you need to clip the prefix tokenization to the longest
common token prefix against the full-prompt tokenization, otherwise a
mid-token split corrupts the cache.

Credit where due: the on-disk KV pattern is inspired by antirez/ds4 on
Apple Metal, which uses an analogous SHA1-keyed file cache. I
implemented the equivalent on the CUDA / Blackwell path.

This doesn’t fix the underlying prefill memory-bandwidth bound, which is
a separate harder problem still open in my own runs. But for the “every
request feels slow because we re-prefill the system prompt” pattern, it
closes most of it.