DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

0rand · June 3, 2026, 4:31pm

You probably answered your question yourself - most likely not need multimodal and work purely as coding/tool call/terminal agent and processing data - 1M tokens of DeepSeek (potential, likely nobody yet achieved actual 1M per session capability on sparks) is very attractive. I honestly don’t know how to work with 256k. My session fills over 256k in one hour. Compaction leads to loss of fidelity, you spend a lot of time explaining it again. In the end it fills session again. Unless you one-shot vibe code, large context is vital.
Just IMO

CosmicRaisins · June 3, 2026, 4:53pm

I’m using Pi for my local mode too. I’ll check out dsv4 in codex when I have a chance next time! I’ve been using Opus for a while now. While I usually stay below 300k ctx, I have never had it halluciate implementation details every since 4.5. Hallucination hasn’t been a problem for me when it comes to coding in any models above 100B in size. Sometimes when a task is badly scoped and vaguely described Claude would pull data from its knowledge based rather than doing web searches but that’s more of a user error from my part. (e.g. Make a plan to build xx stack on my home server and run xxx model).

tonyd615 · June 3, 2026, 5:00pm

why not give it a small local model so it can see ? GitHub - stevibe/local-llm-video-captioning · GitHub

CosmicRaisins · June 3, 2026, 5:03pm

Thanks for the input but in what way did I answer my question? In my experience, DSV4 takes more work to set up, runs at a similar speed as its peers, and isn’t as smart. Is it just a snowball effect? More attention → more adoption → more attention?

Most probably don’t need 1M context with local models too. Most recipes I’ve seen people sharing with dsv4 flash uses 262k ctx length, not 1M. And Mimo v2.5 also supports 1M ctx, and is noticably more intelligent with the same quant.

I have a large codebase but I try to make individual files single task and keep a updated project structure file in memory. I rarely reach 262k ctx doing implementation and exploration work with local model, and had never reached 1M with frontier even with lazy prompting.

I’m not saying DSV4 Flash is incapable in any sense, it’s just interesting to me that so many people are willing to commit a lot of effort to this model to make it work slightly better vs running another model.

0rand · June 3, 2026, 5:21pm

New shiny object? I tried it, got it running half-decent. Went back to Qwen 3.5 122b, which is faster, handles same 512k, and scores higher on my bench. DS4F is cool, but by no means it’s insane. If anything, it is being extremely cheap in cloud so it works as a backup for a local model very well. And cloud inference definitely can handle over 500k - tested. Even though it scores less than local setup (higher quantization likely).

ekkis · June 3, 2026, 7:07pm

Last time I tried Mimo 2.5 it was basically unusable past 100k context so I moved on. Minimax M2.7 was very good at times but also struggled a lot once you got to 150k or so context. DSv4 Flash is faster than Mimo and at least as fast if not faster than Minimax, and has the benefit of staying perfectly lucid at least to 300k context and probably beyond. It stays on point and can follow through on detailed plans in a way I rarely saw with Minimax, so despite being less intelligent on paper it feels much more useable imo.

tonyd615 · June 3, 2026, 7:13pm

agreed, and it knows how to properly tool call

eb.spark · June 3, 2026, 7:34pm

From my perspective DS4 Flash on 2x Spark has going for it:

We can use the original weights to reproduce the results of the API, there is no concern that something went wrong during quantization, NVFP4 vs INT4 etc.
With 1M context, it occupies 102G per Spark. That leaves enough room for other things to run in parallel like image gen or a TTS/STT stack, it’s a very handy size
The latest recipes posted in this thread are remarkable in how little pp or tg fall off at large context sizes. I have been using it in pi coding agent at ~200k context for hours now, and it feels quite similar to a fresh context. No other models I have tried showed that little degradation of throughput at such a large context. Concurrency also works nicely
I believe deepseek has announced that the vision input capable update of DS4 Flash will be released some weeks after the current weights, so I expect image input to come
In terms of quality of results, I have been quite happy, but I have not tried to many alternatives so far (recent spark owner here)

0rand · June 3, 2026, 7:50pm

I agree with these statements, but with a caveat - while we have plenty of RAM to have multiple 1M sessions (my setup had like 6M cache), practicality place huge constraint onto it - all depends on pp speed, if it will crawl at 100 t/s pp it will take hours to get any response, as good as have no cache. I observed, that a very large context suffers disproportionately by having to push context back and forth through fabric/qspf56 link. Large context is better on a single spark, if it can handle the model. if only we could get a nemotron-like context handling with deepseek or qwen type of intelligence. But quite likely it is connection - few attention heads in nemotrons, very small tokens, easy to shove around..

11_p · June 3, 2026, 9:19pm

DeepSeek-V4-Flash (official FP8) on 4× DGX Spark — TP=4, 500K ctx, b12x, ~70 tok/s single-stream + concurrency results

Just got official DeepSeek-V4-Flash running on 4× DGX Spark (GB10) at TP=4 with 500K context using aidendle94’s b12x-optimized vLLM fork. Sharing full numbers since I hadn’t seen a confirmed 4-node benchmark post yet.

Hardware

4× DGX Spark (GB10), 128GB unified memory each
MikroTik CRS812 switch, each node 2×200G RoCE (400G ports broken out)
RoCE / NCCL over CX-7 NICs

Software

Image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready (b12x branch)
vLLM: v0.21.1rc1.dev339+g1967a5627bc3

Key launch flags

--tensor-parallel-size 4
--max-model-len 500000
--max-num-seqs 8
--block-size 256
--gpu-memory-utilization 0.8
--kv-cache-dtype fp8
--distributed-executor-backend mp
--compilation-config {"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}
--speculative-config {"method":"mtp","num_speculative_tokens":2}
--enable-flashinfer-autotune
--enable-prefix-caching
--enable-chunked-prefill

Environment: NCCL_NET=IB, NCCL_IB_DISABLE=0, NCCL_IB_GID_INDEX=0, VLLM_USE_B12X_MOE=1.

Single-stream decode (short prompt, 200 output tokens, server-side)

~70 tok/s sustained with 83% MTP draft acceptance rate (mean acceptance length 2.66).

Prefill throughput (single request, varying context)

Context	Prompt tokens	Time	Prefill tok/s	TTFT
4K	705	7.05s	100*	7.05s
32K	5,505	6.81s	809*	6.81s
128K	22,005	11.05s	1,991	11.05s
256K	44,005	13.25s	3,322	13.25s
500K	85,005	21.25s	4,000	21.25s

*Short contexts are network-latency-dominated from client-side measurement; server-side prefill is faster.

Decode concurrency (200 output tokens per request, client-side measured)

C	Total output	Time	Aggregate tok/s	Per-request avg
1	200	8.70s	23.0	23.0
2	400	10.11s	39.6	19.8
4	800	12.79s	62.5	15.6
8	1,600	13.55s	118.1	14.8

Prefill concurrency (128K context, 5 output tokens, client-side measured)

C	Total prompt	Time	Aggregate tok/s	Avg TTFT
1	22K	5.35s	4,111	5.35s
2	44K	5.58s	7,894	2.79s
4	88K	5.83s	15,106	1.46s

KV cache capacity:
5.3M tokens (~10 concurrent 500K requests safely)

Cold start time (first run):

Model load: ~160s
DeepGEMM warmup: ~2min
TileLang + FlashInfer autotune: ~37s
Total: ~3-4 min (subsequent starts ~40s with cache)

Comparison to known 2-node results:

The 2-node recipe showed ~42 tok/s single-stream decode and ~2000 tok/s prefill (short ctx). On 4 nodes we see:

Decode: ~70 tok/s (1.66× improvement, expected sub-linear due to MoE all-to-all overhead)
Prefill 128K: ~1991 tok/s (matched at equal context)
Prefill 500K: 4000 tok/s (long-context prefills benefit from more TP shards)
Concurrency scaling is near-linear for prefill up to C=4 (15K aggregate)

Gotchas encountered:

GID index on RoCE: NCCL defaulted to GID index 3 which was empty. Fix: NCCL_IB_GID_INDEX=0.
NCCL_NET=IB required: Without it, pip-distributed NCCL won’t use RoCE, causing ibv_modify_qp failures.
Missing /workspace: aidendle94’s image doesn’t have the WORKDIR that the launch script expects. Added docker exec mkdir -p /workspace.
Persistent GID table changed WARN: RoCE interface generates netlink events during runtime. Does not affect functionality; suppress with sysctl net.ipv6.conf.roce*.accept_ra=0 if desired.

Summary:

aidendle94’s b12x fork on 4× DGX Spark delivers solid performance: ~70 tok/s single-stream decode, 4000 tok/s prefill at 500K context, and near-linear prefill concurrency scaling up to 15K tok/s aggregate with sub-1.5s TTFT. The limiting factor is clearly MoE all-to-all cross-node communication, not compute.

tonyd615 · June 4, 2026, 12:26am

How are you running a 4x cluster I’m thinking about it down the road

11_p · June 4, 2026, 1:11am

I’m using a MikroTik CRS812 switch with 400G ports broken out to 2×200G per node. The rest is essentially the same recipe as yours but with 4 nodes, TP=4, and aidendle94’s Docker image instead of building from source. Key fix: NCCL_IB_GID_INDEX=0 was needed for RoCE on the CRS812.

tonyd615 · June 4, 2026, 1:51am

So can you run stuff like GLM ? Might need to DM you

Teason2026 · June 4, 2026, 4:47am

there are multiple PR for deepseek v4 on sm12x, @jasl do you know when vLLM plans to add your PR to main branch? I think you said you contacted vLLM team long time back and they planned to first make deepseek v4 works on datacenter gpu, and later comeback to sm12x, and seems like vLLM v0.22.0 largely closed datacenter gpu support.

github.com/vllm-project/vllm

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes (#41834)

main ← jasl:codex/ds4-sm120-min-enable

opened 03:17PM - 06 May 26 UTC

jasl

+13683 -322

## Latest validation snapshot (2026-06-03) Hardware: 2x NVIDIA RTX PRO 6000 Bla…ckwell Workstation Edition (SM120), plus a reduced 2x GB10 (SM121) long-C2 availability gate. PR head: [`586b7efdb`](https://github.com/jasl/vllm/commit/586b7efdbe6204e267932b98617874445b975455), rebased on upstream/main `59d023619`. Full matrix label: `20260603_decode_isolation_default_user_feedback_matrix/20260603220741`. RTX exact long+long C=2 probe: `20260603_decode_isolation_default_long_long_c2_probe_exact/20260603215100`. GB10 reduced long-C2 label: `20260603_decode_isolation_default_gb10_mtp2_reduced_long_c2/20260603215415`. Primary serve profile: TP=2, MTP=2, FP8 KV cache, block size 256, max model len 131072, `--gpu-memory-utilization 0.975`, `--max-num-batched-tokens 4096`, `--max-num-seqs 4`, expert parallel enabled, prefix cache disabled for cold-latency baselines, and `FULL_AND_PIECEWISE` CUDA graph compilation. The current branch now defers a very-long prefill chunk while decode pressure is already present. This protects active decode cadence and long+long C=2 fairness by serializing the worst overlap case. It is a latency/fairness tradeoff, not a raw single-stream prefill speedup, and it does not justify 256K+/four-card commitments without separate gates. ### Post-push verification | Check | Result | | --- | --- | | Focused scheduler tests | `3 passed` | | Ruff + `git diff --check` | passed | | Full RTX user-feedback matrix | OK, all phases exit `0` | | Prefix-cache stress | filler 100/400/800/1600/3200 all OK, 0 failures | | KV lifecycle | prefix-disabled final idle KV 0.0%; prefix-enabled final idle KV 5.843%, below the 90% stress threshold | | GSM8K limit-200 | flexible EM 0.950, strict EM 0.930 | | GB10 reduced long-C2 gate | OK, no token-cadence failure, no CUDA/NCCL/driver error signals | ### Long-context latency and fairness, RTX PRO 6000 x2 | Shape | Failures | TTFT mean s | Decode tok/s | Decode min/max | ITL p99 s | | --- | ---: | ---: | ---: | ---: | ---: | | 59K C=1 | 0 | 11.649 | 139.639 | 0.989 | 0.021 | | 59K C=2 | 0 | 18.041 | 139.333 | 0.956 | 0.022 | | 124K C=1 | 0 | 29.653 | 106.097 | 0.989 | 0.035 | | 124K C=2 | 0 | 45.620 | 105.511 | 0.934 | 0.031 | | 124K decode-concurrency C=2 | 0 | 45.614 | 104.056 | 0.960 | 0.031 | The exact long+long C=2 probe after the default scheduler change produced decode mean 101.621 tok/s, decode min/max 0.963, p99 ITL 0.0347s, and zero overlap steps in the scheduler trace. The intended tradeoff is visible: the second long prefill waits instead of interfering with an active decode stream. ### Mixed arrival, RTX PRO 6000 x2 | Case | Failures | Primary TTFT s | Secondary TTFT s | Decode min/max | Secondary ITL p99 s | | --- | ---: | ---: | ---: | ---: | ---: | | decode_then_124k | 0 | 30.235 | 30.746 | 0.959 | 0.035 | | decode_then_59k | 0 | 12.178 | 12.708 | 0.956 | 0.022 | | long_then_short | 0 | 31.626 | 3.302 | 0.556 | 0.017 | ### Throughput profile, RTX PRO 6000 x2 | Workload | C=1 | C=2 | C=4 | C=8 | C=16 | C=24 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | Short MT bench output tok/s | 172.10 | 270.07 | 403.03 | 571.15 | 781.62 | 933.66 | | Random 8K/1K output tok/s | 126.92 | 187.50 | 257.31 | 323.26 | 384.97 | 406.34 | | Random 256/256 output tok/s | 147.12 | 233.33 | 355.04 | 506.79 | 723.25 | 808.45 | ### GB10 status The reduced GB10 long-C2 gate passed after the same default scheduler change: 4 requests, 0 failures, max TTFT 222.868s, p99 ITL 0.089s, max ITL 0.089s, preemptions 0, KV max 35.48%, and no CUDA/NCCL/driver error signals. This is an availability/cadence result, not a GB10 throughput claim. ### Open follow-up gates - 256K+/four-card behavior remains unclaimed until validated on appropriate hardware. - A separate external high-concurrency C=256 sparse-MLA decode workspace-sizing report is tracked by the harness workspace stress gate. It is not treated as fixed by the scheduler change. - Further raw prefill improvements likely need sparse MLA kernel work; the current scheduler change addresses interference/fairness, not the underlying single-stream prefill ceiling. --- ## Historical validation notes ## Purpose Enable DeepSeek V4 Flash on SM12x Blackwell consumer hardware (RTX PRO 6000 Workstation Edition, RTX 5090, DGX Spark GB10). The core challenge: SM12x lacks the TMEM / `tcgen05` instructions present on datacenter Blackwell (SM10x), so DeepGEMM, FlashMLA, and Marlin's FP8 paths fail at kernel link time on this hardware. This PR provides pure-PyTorch fallbacks, Triton kernel implementations, and SM12x-specific tuning so the model runs end-to-end with production-quality perf. ## Validation results Hardware: 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition. PR head: [`3424fba51`](https://github.com/jasl/vllm/commit/3424fba51301504262c3d8355e2560469f18c9c4). Rebased on `upstream/main` 2026-05-19; NCCL: `nvidia-nccl-cu13` 2.30.4. Reference long-context serve config used for the 2026-05-18 run: ```bash vllm serve deepseek-ai/DeepSeek-V4-Flash \ --kv-cache-dtype fp8 \ --block-size 256 \ --max-model-len 131072 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.98 \ --max-num-seqs 4 \ --max-num-batched-tokens 4096 \ --tokenizer-mode deepseek_v4 \ --tool-call-parser deepseek_v4 \ --enable-auto-tool-choice \ --reasoning-parser deepseek_v4 \ --reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}' \ --speculative_config '{"method":"mtp","num_speculative_tokens":2}' \ --no-enable-prefix-caching \ --no-enable-flashinfer-autotune ``` `--no-enable-prefix-caching` is set for these latency baselines so cold prefill numbers are not biased by cache hits. End-user document/chat deployments should generally keep prefix caching enabled. ### Accuracy `lm_eval` `gsm8k` 5-shot, 200 questions, `temperature=0`, `max_gen_toks=2048`, via `/v1/completions`: | Variant | strict-match | | --- | ---: | | no-MTP | 95.5% | | MTP=2 | 95.0% | Within the historical 0.948-0.965 band on this model. ### Performance (mt-bench, `philschmid/mt-bench`, 80 prompts) | c | no-MTP TPOT (ms) | no-MTP tok/s | no-MTP TTFT med (ms) | MTP=2 TPOT (ms) | MTP=2 tok/s | MTP=2 TTFT med (ms) | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 1 | 9.9 | 98 | 53 | **5.7** | **165** | 59 | | 2 | 11.7 | 163 | 72 | 7.5 | 248 | 84 | | 4 | 14.4 | 266 | 77 | 11.8 | 316 | 100 | | 8 | 19.4 | 380 | 86 | 13.7 | 530 | 117 | | 16 | 27.3 | 520 | 120 | 19.3 | 721 | 161 | | 24 | 32.6 | 607 | 391 | **23.2** | **846** | 194 | MTP=2 peak: **165 tok/s single-stream**, **846 tok/s @ c=24**. MTP=2 acceptance length 2.35-2.38 on real-content prompts, pos-0 acceptance 84-85%. ### Long-context prefill Earlier long-context work in this PR added `_accumulate_indexed_attention_chunk_multihead_kernel` (HEAD_BLOCK=8) and overlapped the C128A prefill KV gather with the indexer forward. The latest two commits add a direct SM120 MQA top-k fallback path: Triton materializes the FP8 MQA logits, then the existing custom `top_k_per_row_prefill` op selects top-k row indices without the slower PyTorch per-chunk score path. Dedicated 128K A/B sweep on the same 2x RTX PRO 6000 setup, C=1, cold, `max_tokens=64`: | Build | 127,056-token TTFT mean | Delta vs parent | | --- | ---: | ---: | | Before direct MQA top-k fallback | 60.83 s | - | | Triton MQA logits + PyTorch top-k (`f32b9e782`) | 37.65 s | -38.1% | | Triton MQA logits + custom row top-k (`709f50d10`) | 36.87 s | -2.1% | Conservative full-validation rerun on 2026-05-18, C=1, cold, `max_tokens=64`, repeat=3: | Prompt tokens | C | TTFT mean | TTFT max | Elapsed mean | Failures | | ---: | ---: | ---: | ---: | ---: | ---: | | 63,568 | 1 | 14.71 s | 14.84 s | 15.16 s | 0 / 3 | | 127,056 | 1 | 38.23 s | 38.38 s | 38.84 s | 0 / 3 | That conservative rerun is **37.1% lower TTFT** than the pre-top-k 128K baseline (60.83 s -> 38.23 s), a **1.59x** speedup. Small-concurrency long-context matrix on 2026-05-18, cold salted prompts, `max_tokens=128`, repeat=2, prefix cache disabled: | Prompt tokens | C | Requests | TTFT mean | TTFT max | Elapsed mean | Failures | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 63,568 | 1 | 2 | 14.96 s | 15.05 s | 15.45 s | 0 | | 63,568 | 2 | 4 | 23.45 s | 32.18 s | 31.61 s | 0 | | 63,568 | 4 | 8 | 38.96 s | 63.83 s | 55.67 s | 0 | | 127,056 | 1 | 2 | 39.93 s | 41.66 s | 40.59 s | 0 | | 127,056 | 2 | 4 | 58.68 s | 80.52 s | 69.53 s | 0 | | 127,056 | 4 | 8 | 99.10 s | 162.16 s | 122.25 s | 0 | Short-context warmed regression check, 4,047 prompt tokens, cold salted prompts, `max_tokens=64`, repeat=2: | C | TTFT mean | TTFT max | Failures | | ---: | ---: | ---: | ---: | | 1 | 0.643 s | 0.646 s | 0 / 2 | | 2 | 1.019 s | 1.520 s | 0 / 4 | | 4 | 1.724 s | 3.096 s | 0 / 8 | No short-context regression versus the same-machine pre-top-k baseline (4,047 prompt tokens: C=1 0.689 s, C=2 1.125 s, C=4 2.072 s TTFT mean). ### Post-rebase MTP C=4 stability and 64K/128K gate (2026-05-19) This section is historical. An earlier debug iteration tried avoiding the full decode CUDA graph for DeepSeek V4 MTP, but that was removed. The current branch keeps `FULL_AND_PIECEWISE` enabled and fixes the stability path through warmup / dummy-shape handling instead of disabling full decode graph coverage. Short-context MTP matrix, prefix cache disabled, 131K max-model-len, 4096 max-num-batched-tokens, TP=2, 16 prompts: | C | Successful requests | Output tok/s | Mean TTFT | MTP acceptance | | ---: | ---: | ---: | ---: | ---: | | 1 | 16 / 16 | 65.01 | 87.88 ms | 64.16% | | 2 | 16 / 16 | 127.86 | 194.96 ms | 63.39% | | 4 | 16 / 16 | 225.48 | 254.46 ms | 64.85% | Full long-context promotion gate, prefix cache disabled, cold prompts, `max_tokens=128`, repeat=3: | Prompt tokens | C | Requests | TTFT mean | TTFT max | Failures | | ---: | ---: | ---: | ---: | ---: | ---: | | 62,080 | 1 | 3 | 13.009 s | 13.036 s | 0 | | 62,080 | 2 | 6 | 20.370 s | 26.906 s | 0 | | 62,080 | 3 | 9 | 27.672 s | 41.810 s | 0 | | 62,080 | 4 | 12 | 34.554 s | 54.625 s | 0 | | 124,080 | 1 | 3 | 32.779 s | 32.797 s | 0 | | 124,080 | 2 | 6 | 49.830 s | 67.093 s | 0 | | 124,080 | 3 | 9 | 66.912 s | 104.247 s | 0 | | 124,080 | 4 | 12 | 84.197 s | 138.497 s | 0 | GSM8K limit-200, 5-shot, MTP concurrency 1: `exact_match_flexible=0.960`, `exact_match_strict=0.955`. Targeted regression tests for the current graph-preserving path are superseded by the latest validation snapshot above; the current branch does not rely on a full-decode CUDA graph disable workaround. ### Acceptance (toolcall-15 scenario battery) | Variant | score | failures | | --- | ---: | ---: | | no-MTP | 91% | 13 / 135 cases | | MTP=2 | 92% | 12 / 135 cases | This is the first SM12x baseline that evaluates thinking-mode correctly. Two prior harness bugs masked thinking-mode entirely across every earlier retry: 1. The harness was sending `extra_body.thinking={"type":"enabled"}` at the top level, which is the Claude API shape. vLLM's DSv4 chat-template entry reads `chat_template_kwargs.thinking` instead, so every request silently routed to chat mode. Fixed by 323aa1f (confirmed in this PR discussion by qym-ll). 2. The transcript / replay path read `message.reasoning_content`, but this vLLM OpenAI frontend build populates `message.reasoning`. The harness now normalizes both keys. The remaining failures stay concentrated in `TC-06` (Multi-Value Extraction, 7/7 across modes) plus scattered TC-11 / TC-14 / TC-15: characteristic helpfulness-bias / deflect-rather-than-refuse model behaviours, not SM12x regressions. ### Comparison to DeepSeek's official hosted API Same prompts run against `api.deepseek.com/v1/chat/completions` with `model=deepseek-v4-flash`, same `temperature=1.0 top_p=1.0`, and the same thinking-mode shape: | Source | toolcall-15 score | failures / cases | | --- | ---: | ---: | | DeepSeek hosted API | 96% | 2 / 45 (1 round) | | This PR, MTP=2 | 92% | 12 / 135 (3 rounds) | | This PR, no-MTP | 91% | 13 / 135 (3 rounds) | Per-case failure rate: hosted 4.4%, this PR 8.9-9.6%. The hosted service either ships a checkpoint we have not pulled from the HF release, or injects an internal tool-use system prompt. Either way the local vs hosted gap on this PR is the smallest it has been in any baseline shipped here. ### vs 2026-05-12 deployment baseline ([`1c20f1a6d`](https://github.com/jasl/vllm/commit/1c20f1a6d), same hardware) | Metric | 2026-05-12 | 2026-05-17/18 (this PR) | Delta | | --- | ---: | ---: | ---: | | no-MTP mt-bench c=1 tok/s | 89 | 98 | **+10%** | | MTP=2 mt-bench c=1 tok/s | 137 | 165 | **+20%** | | no-MTP mt-bench c=24 tok/s | 557 | 607 | **+9%** | | MTP=2 mt-bench c=24 tok/s | 706 | 846 | **+20%** | | 128K cold C=1 TTFT mean | 60.83 s | 38.23 s | **-37.1%** | ## 2026-06-01 SM120 C=2 fairness update Latest pushed commit: `0440ee5c2` (`Protect very-long prefill fairness`). The earlier 59K/124K long+long C=2 blocker was not fixed by generic chunk-size or kernel launch sweeps. The retained fix is a scheduler-side admission guard: while one very-long prefill is already active, another waiting very-long prefill is deferred for the current scheduler step, but short requests can still be admitted. This keeps `FULL_AND_PIECEWISE` enabled and adds no public tuning knob. Fixed repeat artifact label: `20260601_c2_defer_long_prefill_fixed_repeat/20260601175617`. | Shape | TTFT mean | Decode mean | Decode min/max | ITL p99 | | --- | ---: | ---: | ---: | ---: | | 59K C=1 | 11.788 s | 142.772 tok/s | 0.954 | 0.022 s | | 59K C=2 | 18.642 s | 81.442 tok/s | 0.239 | 0.085 s | | 124K C=1 | 30.160 s | 106.853 tok/s | 0.982 | 0.029 s | | 124K C=2 | 45.972 s | 68.554 tok/s | 0.306 | 0.092 s | | 124K decode-concurrency C=2 | 45.898 s | 68.501 tok/s | 0.309 | 0.092 s | | Mixed long_long_c2 | n/a | 67.797 tok/s | 0.297 | 0.092 s | No-regression follow-up: | Gate | Result | | --- | --- | | Scheduler tests | `108 passed`; ruff passed | | 8K/1K C=1/2/4 PR performance gate | 111.06 / 169.81 / 240.93 tok/s vs accepted 112.44 / 167.86 / 239.99; TPOT within tolerance | | GSM8K 5-shot limit-200 | flexible 0.960, strict 0.950 | | Key user-feedback matrix | primary, prefix-cache stress, and prefix-cache-enabled KV lifecycle all exited 0 | | Streaming pressure | 36 requests, 0 failures, max TTFT 52.624 s, ITL p99 0.717 s | | Prefix-cache stress | filler 100/400/800/1600/3200, all 5-trial stress phases 0 failures | | KV lifecycle | prefix disabled idle KV max 0.0%; prefix enabled final idle KV 5.894%, bounded | | Runtime health | no CUDA/NCCL/driver/engine/runtime error signal; GPUs returned to idle | Scope caveat: these are dual RTX PRO 6000, 128K-class results. GB10 long C=2 high-SM/no-progress and 256K+/4-card behavior still need their own reduced gates before they should be treated as customer commitments. ## Verification commands ```bash ruff check vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py tests/v1/attention/test_sm120_deepgemm_fallbacks.py python -m py_compile vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py tests/v1/attention/test_sm120_deepgemm_fallbacks.py python -m pytest tests/v1/attention/test_sm120_deepgemm_fallbacks.py -q ``` Result: `4 passed, 16 warnings`. Long-context matrix verification: ```bash # 64K / 128K, C=1/2/4, cold, prefix cache disabled scripts/run_long_context_latency_matrix.sh # 4K warmed short-context regression, C=1/2/4 scripts/run_long_context_latency_matrix.sh ``` Results: 64K/128K matrix `PASS`, 6 groups, 0 failures; warmed short-context matrix `PASS`, 3 groups, 0 failures. ## Known caveats - **MTP=1 NCCL allgather hang** under sustained multi-stream load was reproduced once in earlier baselines at c=4 mid-bench. This is outside the SM12x fallback patch surface (Torch NCCL `ProcessGroupWatchdog`) and MTP=1 remains smoke-tier pending repro on NCCL 2.30.4+. - **MTP=3 demoted to smoke-tier**: net slightly slower than MTP=2 at every c measured so far. Worth re-checking if upstream MTP draft kernels become cheaper per K. - **Prefix caching disabled** in the reference cold-prefill numbers above. The locally cherry-picked `vllm-project/vllm#42784` fix means prefix cache does work on DSv4 SWA when enabled; a cache-on companion run is still useful for real document-chat deployment. - **Context limit of this validation host**: the current dual RTX PRO 6000 setup can validate up to ~131K model length. 256K, 512K, and 1M scenarios still need larger GPU count / KV budget validation. ## Acknowledgments - @alexbi29 contributed three improvements landed in this revision: - **Multi-head prefill accumulate kernel** (`_accumulate_indexed_attention_chunk_multihead_kernel`, HEAD_BLOCK=8), patterned after the existing decode `_finish_materialized_scores_with_sink_kernel`. - The SWA `_cache_block_mask` over-aggression for Eagle/MTP groups, fixed by `vllm-project/vllm#42784` (cherry-picked locally pending upstream merge). - The `_deepseek_v4_sm12x_fp8_einsum_kernel` autotune key including `num_tokens`, causing per-request 4-config re-benchmarks; we pinned the winning config and removed the decorator. - @aabbccddwasd contributed the **C128A prefill KV gather overlap** with the indexer (`_aux_stream[1]` overlap of `dequantize_and_gather_k_cache` with `indexer.forward`). - @aabbccddwasd's PR-comment suggestions also led to the per-token early-exit on sparse MLA accumulate, the C128A top-k metadata loop cap at `effective_topk`, and the multi-head prefill kernel direction. - @infernix pointed to the fast-prefill autoresearch branch and DeepGEMM SM120 work in [this PR discussion](https://github.com/vllm-project/vllm/pull/41834#issuecomment-4476480477). The latest direct MQA top-k fallback commits were evaluated from that lead and keep the effective path in this PR branch. ## AI assistance disclosure Claude (Anthropic), GPT-5.4, and GPT-5.5 were used for code review, refactoring, regression-script writing, and benchmark analysis. All kernel logic and architectural decisions were validated by human review and end-to-end benchmarks before each push.

corbett_korbett · June 4, 2026, 5:03am

What is everyones working launch command and recipe? I keep having both nodes use 93gb of memory each the second I run the launch command and by the time it says it finished loading the weights I get hit with 130gb memory use by each node then full system lock up to the point I have to hard restart each node. Even changed the max memory use down to 0.5 did not help at all.

0rand · June 4, 2026, 8:34am

I was able to fully replicate a reddit recipe and results are outstanding. I will create a new thread to have it more visible, it worth it

Posted here: DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10

jasl · June 4, 2026, 8:57am

They probably do not have direct support for SM12x, unfortunately.

corbett_korbett · June 4, 2026, 10:40am

Tried to get it working for 4 hours and when it finally loaded up all responses are incoherent.

cksdnd0106 · June 5, 2026, 2:53am

Which driver version are you using? I’ve tried the exact same recipes, but I always run into OOM (Out of Memory) errors. My current driver version is 580.159.03.

How can I upgrade my drivers, and is there a proper/recommended way to do it?

flykitey · June 5, 2026, 3:24am

thanks for sharing your work! I’m currently trying to deploy DeepSeek v4 on my 2-Spark Ray cluster(GB10) following the official documentation. I’ve set up the Ray cluster and attempted deployment using the deepseekv4-arm64-cu130 vLLM image, but encountered an error: “RuntimeError: call, /opt/venv/lib/python3.12/site-packages/torch/include/torch/csrc/stable/stableivalue_conversions.h:544, Not yet supported ScalarType 44, please file an issue describing your use case.”.

I noticed that most other deployments mentioned in this thread are running with no Ray, so I’m unsure whether the vLLM image and Docker branch referenced here would actually work in a Ray cluster environment. Could you share some details on how you implemented model deployment DeepSeekv4 Flash on Ray? Any insights or configuration tips would be greatly appreciated. Thanks in advance!

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	5987	June 15, 2026
DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10 DGX Spark / GB10 deepseek	134	7547	June 21, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	230	June 19, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	73	6568	June 20, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	16140	May 18, 2026
DeepSeek V4 Flash (1,048,576 Context) on 2x DGX Spark – Custom Sparkrun Recipe DGX Spark / GB10 jetson , deepseek	11	652	June 14, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1863	May 11, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	2905	May 17, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1309	June 4, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8715	March 14, 2026

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Related topics