DeepSeek-V4-Flash-DSpark on 2× DGX Spark (GB10) — big single-stream speed boost (~60-67 tok/s) + 1M context, now with concurrency

The headline: DSpark (DeepSeek’s speculative-decode build of V4 Flash) gives a real SPEED boost on 2× DGX Spark, TP=2. On code-gen we’re seeing ~60-67 tok/s up from the ~40-45 tok/s we got on plain V4 Flash decode. Spec-decode drafts several tokens per step and the target verifies them in one pass, so when draft acceptance is high (predictable content like code) you get a big lift. ~40 tok/s on mixed/diverse content where acceptance is lower.

On top of that: NVFP4 KV cache (nvfp4_ds_mla) → full 1M context on a ~2M-token KV pool.

Recipe (Dockerfiles + launch + benchmarks):

Single-stream (TP=2, thinking-off, warmed):

- Code: ~67 tok/s

- Mixed: ~40 tok/s

CONCURRENCY (new): with Keys/drowzeys’ concurrency patch, DSpark now serves multiple streams at once, it used to be single-stream only. At full 1M context + max_num_seqs=6: 6/6 concurrent, ~182 tok/s aggregate. (A 200K / seqs=16 profile pushes ~315 tok/s aggregate static.)

Quick KV-math note (people ask): ~2M pool, 1M context, 6 concurrent is NOT 6×1M reserved. max_model_len is a per-request CEILING, max_num_seqs a concurrency CAP; the pool is SHARED + allocated on demand. Real limit = sum(live tokens) ≤ ~2M. Agent turns never hit 1M, so 6 normal streams share it fine.

Credits: Keys/drowzeys (concurrency patch), Fraser Price (DSpark model), Rafael Caricio (vLLM integration), MiaAI-Lab (2-node launch), DeepSeek-AI (DeepSpec).

Join my Discord - https://discord.gg/7BbAvza5e

Starting DSpark worker on 192.168.1.201…
Image vllm-dspark-runtime:dspark-nvfp4-stage-c Pulling
Image vllm-dspark-runtime:dspark-nvfp4-stage-c Error pull access denied for vllm-dspark-runtime, repository does not exist or may require ‘docker login’: d
enied: requested access to the resource is denied
Error response from daemon: pull access denied for vllm-dspark-runtime, repository does not exist or may require ‘docker login’: denied: requested access to
the resource is denied

I have this problem at trying it.

Btw, great job :D

NVM: I skip the step of building image :D solved!

YouTube star it keeping us busy! Good stuff. I will try as well :) Thanks! Guys, go all to YouTube and give a sub to http://www.youtube.com/@Tech2wild1

Hi,

I’m trying to run the concurrency profile from the repository.

I have two questions:

  1. For the concurrency profile, should I start from the 1M profile and keep all of its runtime environment variables, changing only these?
MAX_MODEL_LEN=200000
MAX_NUM_SEQS=16
VLLM_USE_B12X_WO_PROJECTION=1
VLLM_DSPARK_GPU_REJECTED_CONTEXT_MASK=1

Or should I comment out/remove the other runtime env variables from the 1M profile and only use the ones required for concurrency?

Would it be possible to add a .env.example for the concurrency profile to the repository? It would help make the required environment variables clearer and easier to set up for new users.

  1. I’m also unable to apply the concurrency patch. Running:
git apply patches/keys-concurrency.patch

results in:

error: patch failed: recipe/overlay/vllm/models/deepseek_v4/nvidia/dspark.py:285
error: recipe/overlay/vllm/models/deepseek_v4/nvidia/dspark.py: patch does not apply
error: patch failed: recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py:76
error: recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py: patch does not apply
error: patch failed: recipe/overlay/vllm/v1/worker/gpu_model_runner.py:5247
error: recipe/overlay/vllm/v1/worker/gpu_model_runner.py: patch does not apply

Is this patch intended for a specific commit or branch? If so, which revision should I be using?

Thanks for your work :)

You can do the 1M model lens. You can run any concurrency you just need to think about how many concurrencys you are going to be using and whether they will total 2M on a given turn. I set mine to 6 at 1M. 200k at 16 means you would need to use 16 sessions at a full 200k to hit 3M tokens which would then cause the vllm to pause sessions til cache is free. Alot this wont even happen with agentic work. It is really just a give or take. I run 1M at 6 Seqs because I use 6 agents on this work flow they all have 1M context and stagger in their work.

for everbody who wants to know the difference between DS4 Flash and Mimo V2.5

Metric DSpark MiMo
Speed single (code) 59-71 ~38 DSpark +55%
Speed aggregat C6 148 96 DSpark +55%
Context 1M 160K DSpark
reliability 43/79 56/79 MiMo +13
tool-eval 88/100 90/100 MiMo +2 (knapp)
Safety 19/26 26/26 MiMo
Halluzination (refusal) 6/9 9/9 MiMo

Mimo has 1M context as well bro

i compared speed settings that dont give you max KV, so i run 160K context as middle ground with 530K KV Cache on Mimo, your 1M context config is slower than mine

What speed you getting at 1M ? I might have a fix coming through on that

on 1M nobody is fast, youre dropping below 10tk/s, i would need to check my data but in a short term as said useful for me 160K, im faster

I’ve tried it and it looks good.

I encountered two issues :

  1. CUDA illegal memory access after a while, crashed
  2. Some gibberish

Averaged 56 tps on a coding session, so a good speed-up.

This will be great once stabilized!

DS4F does 18-20 at 1M so it’s tolerable

then you have good MTP acceptance :)

Will not be as good as that for everybody but maybe usable as Baseline, i checked for me and quality killed DS4 Flash as well as Omni Features, guess so if you dont need that much you’re way better off with DS4 Flash

Is NVFP4 DSpark fast enough and good enough in quality to justify dropping Aiden FP8 and switching over?

Check my thread - I shared my recipe. You need at least 6k of batch chunks per 1 mtp token. Bad acceptance not only ruins speed but can affect quality. I ended up with 8k prefill batch and 1 token mtp
I need sometimes to look at screenshots of charts or websites - I wired a separate model with vision, works fine in both Hermes and Opencode.

you mean KV Cache ? i saw quality drop with FP4 KV Cache to FP8 before…

sounds good but as said, i never had better results than Mimo and Omni coming on top, nevertheless impressive what we now get out of DS4 Flash already

If only MiMo v2.5 could work on 2x sparks past 32k context…

i have 530K KV Cache and 160K Context

I have 2M(1 per seq) cache but it collapses to 3 t/s after 256k and never gets to 1M. Early days.