DeepSeek-V4-Flash-DSpark on 2× DGX Spark (GB10) — big single-stream speed boost (~60-67 tok/s) + 1M context, now with concurrency

tonyd615 · June 29, 2026, 6:45am

The headline: DSpark (DeepSeek’s speculative-decode build of V4 Flash) gives a real SPEED boost on 2× DGX Spark, TP=2. On code-gen we’re seeing ~60-67 tok/s up from the ~40-45 tok/s we got on plain V4 Flash decode. Spec-decode drafts several tokens per step and the target verifies them in one pass, so when draft acceptance is high (predictable content like code) you get a big lift. ~40 tok/s on mixed/diverse content where acceptance is lower.

On top of that: NVFP4 KV cache (nvfp4_ds_mla) → full 1M context on a ~2M-token KV pool.

Recipe (Dockerfiles + launch + benchmarks):

Single-stream (TP=2, thinking-off, warmed):

- Code: ~67 tok/s

- Mixed: ~40 tok/s

CONCURRENCY (new): with Keys/drowzeys’ concurrency patch, DSpark now serves multiple streams at once, it used to be single-stream only. At full 1M context + max_num_seqs=6: 6/6 concurrent, ~182 tok/s aggregate. (A 200K / seqs=16 profile pushes ~315 tok/s aggregate static.)

Quick KV-math note (people ask): ~2M pool, 1M context, 6 concurrent is NOT 6×1M reserved. max_model_len is a per-request CEILING, max_num_seqs a concurrency CAP; the pool is SHARED + allocated on demand. Real limit = sum(live tokens) ≤ ~2M. Agent turns never hit 1M, so 6 normal streams share it fine.

Credits: Keys/drowzeys (concurrency patch), Fraser Price (DSpark model), Rafael Caricio (vLLM integration), MiaAI-Lab (2-node launch), DeepSeek-AI (DeepSpec).

Join my Discord - https://discord.gg/7BbAvza5e

bubexel · June 29, 2026, 8:25am

Starting DSpark worker on 192.168.1.201…
Image vllm-dspark-runtime:dspark-nvfp4-stage-c Pulling
Image vllm-dspark-runtime:dspark-nvfp4-stage-c Error pull access denied for vllm-dspark-runtime, repository does not exist or may require ‘docker login’: d
enied: requested access to the resource is denied
Error response from daemon: pull access denied for vllm-dspark-runtime, repository does not exist or may require ‘docker login’: denied: requested access to
the resource is denied

I have this problem at trying it.

Btw, great job :D

NVM: I skip the step of building image :D solved!

0rand · June 29, 2026, 8:26am

YouTube star it keeping us busy! Good stuff. I will try as well :) Thanks! Guys, go all to YouTube and give a sub to http://www.youtube.com/@Tech2wild1

martinslanina · June 29, 2026, 9:55am

Hi,

I’m trying to run the concurrency profile from the repository.

I have two questions:

For the concurrency profile, should I start from the 1M profile and keep all of its runtime environment variables, changing only these?

MAX_MODEL_LEN=200000
MAX_NUM_SEQS=16
VLLM_USE_B12X_WO_PROJECTION=1
VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1

Or should I comment out/remove the other runtime env variables from the 1M profile and only use the ones required for concurrency?

Would it be possible to add a .env.example for the concurrency profile to the repository? It would help make the required environment variables clearer and easier to set up for new users.

I’m also unable to apply the concurrency patch. Running:

git apply patches/keys-concurrency.patch

results in:

error: patch failed: recipe/overlay/vllm/models/deepseek_v4/nvidia/dspark.py:285
error: recipe/overlay/vllm/models/deepseek_v4/nvidia/dspark.py: patch does not apply
error: patch failed: recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py:76
error: recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py: patch does not apply
error: patch failed: recipe/overlay/vllm/v1/worker/gpu_model_runner.py:5247
error: recipe/overlay/vllm/v1/worker/gpu_model_runner.py: patch does not apply

Is this patch intended for a specific commit or branch? If so, which revision should I be using?

Thanks for your work :)

tonyd615 · June 29, 2026, 1:08pm

You can do the 1M model lens. You can run any concurrency you just need to think about how many concurrencys you are going to be using and whether they will total 2M on a given turn. I set mine to 6 at 1M. 200k at 16 means you would need to use 16 sessions at a full 200k to hit 3M tokens which would then cause the vllm to pause sessions til cache is free. Alot this wont even happen with agentic work. It is really just a give or take. I run 1M at 6 Seqs because I use 6 agents on this work flow they all have 1M context and stagger in their work.

renek · June 29, 2026, 1:14pm

for everbody who wants to know the difference between DS4 Flash and Mimo V2.5

Metric	DSpark	MiMo
Speed single (code)	59-71	~38	DSpark +55%
Speed aggregat C6	148	96	DSpark +55%
Context	1M	160K	DSpark
reliability	43/79	56/79	MiMo +13
tool-eval	88/100	90/100	MiMo +2 (knapp)
Safety	19/26	26/26	MiMo
Halluzination (refusal)	6/9	9/9	MiMo

tonyd615 · June 29, 2026, 1:26pm

Mimo has 1M context as well bro

renek · June 29, 2026, 1:41pm

i compared speed settings that dont give you max KV, so i run 160K context as middle ground with 530K KV Cache on Mimo, your 1M context config is slower than mine

tonyd615 · June 29, 2026, 1:51pm

What speed you getting at 1M ? I might have a fix coming through on that

renek · June 29, 2026, 1:54pm

on 1M nobody is fast, youre dropping below 10tk/s, i would need to check my data but in a short term as said useful for me 160K, im faster

co-le · June 29, 2026, 2:02pm

I’ve tried it and it looks good.

I encountered two issues :

CUDA illegal memory access after a while, crashed
Some gibberish

image1480×1217 209 KB

Averaged 56 tps on a coding session, so a good speed-up.

This will be great once stabilized!

0rand · June 29, 2026, 2:39pm

DS4F does 18-20 at 1M so it’s tolerable

renek · June 29, 2026, 2:45pm

then you have good MTP acceptance :)

Will not be as good as that for everybody but maybe usable as Baseline, i checked for me and quality killed DS4 Flash as well as Omni Features, guess so if you dont need that much you’re way better off with DS4 Flash

slackyrabbit · June 29, 2026, 2:47pm

Is NVFP4 DSpark fast enough and good enough in quality to justify dropping Aiden FP8 and switching over?

0rand · June 29, 2026, 2:48pm

Check my thread - I shared my recipe. You need at least 6k of batch chunks per 1 mtp token. Bad acceptance not only ruins speed but can affect quality. I ended up with 8k prefill batch and 1 token mtp
I need sometimes to look at screenshots of charts or websites - I wired a separate model with vision, works fine in both Hermes and Opencode.

renek · June 29, 2026, 2:48pm

you mean KV Cache ? i saw quality drop with FP4 KV Cache to FP8 before…

renek · June 29, 2026, 2:49pm

sounds good but as said, i never had better results than Mimo and Omni coming on top, nevertheless impressive what we now get out of DS4 Flash already

0rand · June 29, 2026, 2:50pm

If only MiMo v2.5 could work on 2x sparks past 32k context…

renek · June 29, 2026, 2:52pm

i have 530K KV Cache and 160K Context

0rand · June 29, 2026, 2:52pm

I have 2M(1 per seq) cache but it collapses to 3 t/s after 256k and never gets to 1M. Early days.

Topic		Replies	Views
New DeepSeek-V4-Flash-DSpark DGX Spark / GB10 deepseek	5	3663	June 29, 2026
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	255	18430	June 28, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	77	7627	June 28, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	6456	June 15, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	487	June 19, 2026
DFlash for Qwen3.5-122B-A10B = 80+ tok/s on 1x Spark! DGX Spark / GB10 Projects agentic-ai	28	2770	June 29, 2026
DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10 DGX Spark / GB10 deepseek	226	11532	June 30, 2026
DeepSeek V4 Flash (1,048,576 Context) on 2x DGX Spark – Custom Sparkrun Recipe DGX Spark / GB10 jetson , deepseek	11	838	June 14, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2892	May 11, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	55	19117	June 27, 2026

DeepSeek-V4-Flash-DSpark on 2× DGX Spark (GB10) — big single-stream speed boost (~60-67 tok/s) + 1M context, now with concurrency

Related topics