The headline: DSpark (DeepSeek’s speculative-decode build of V4 Flash) gives a real SPEED boost on 2× DGX Spark, TP=2. On code-gen we’re seeing ~60-67 tok/s up from the ~40-45 tok/s we got on plain V4 Flash decode. Spec-decode drafts several tokens per step and the target verifies them in one pass, so when draft acceptance is high (predictable content like code) you get a big lift. ~40 tok/s on mixed/diverse content where acceptance is lower.
On top of that: NVFP4 KV cache (nvfp4_ds_mla) → full 1M context on a ~2M-token KV pool.
Recipe (Dockerfiles + launch + benchmarks):
Single-stream (TP=2, thinking-off, warmed):
- Code: ~67 tok/s
- Mixed: ~40 tok/s
CONCURRENCY (new): with Keys/drowzeys’ concurrency patch, DSpark now serves multiple streams at once, it used to be single-stream only. At full 1M context + max_num_seqs=6: 6/6 concurrent, ~182 tok/s aggregate. (A 200K / seqs=16 profile pushes ~315 tok/s aggregate static.)
Quick KV-math note (people ask): ~2M pool, 1M context, 6 concurrent is NOT 6×1M reserved. max_model_len is a per-request CEILING, max_num_seqs a concurrency CAP; the pool is SHARED + allocated on demand. Real limit = sum(live tokens) ≤ ~2M. Agent turns never hit 1M, so 6 normal streams share it fine.
Starting DSpark worker on 192.168.1.201…
Image vllm-dspark-runtime:dspark-nvfp4-stage-c Pulling
Image vllm-dspark-runtime:dspark-nvfp4-stage-c Error pull access denied for vllm-dspark-runtime, repository does not exist or may require ‘docker login’: d
enied: requested access to the resource is denied
Error response from daemon: pull access denied for vllm-dspark-runtime, repository does not exist or may require ‘docker login’: denied: requested access to
the resource is denied
YouTube star it keeping us busy! Good stuff. I will try as well :) Thanks! Guys, go all to YouTube and give a sub to http://www.youtube.com/@Tech2wild1
Or should I comment out/remove the other runtime env variables from the 1M profile and only use the ones required for concurrency?
Would it be possible to add a .env.example for the concurrency profile to the repository? It would help make the required environment variables clearer and easier to set up for new users.
I’m also unable to apply the concurrency patch. Running:
git apply patches/keys-concurrency.patch
results in:
error: patch failed: recipe/overlay/vllm/models/deepseek_v4/nvidia/dspark.py:285
error: recipe/overlay/vllm/models/deepseek_v4/nvidia/dspark.py: patch does not apply
error: patch failed: recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py:76
error: recipe/overlay/vllm/v1/spec_decode/dspark_proposer.py: patch does not apply
error: patch failed: recipe/overlay/vllm/v1/worker/gpu_model_runner.py:5247
error: recipe/overlay/vllm/v1/worker/gpu_model_runner.py: patch does not apply
Is this patch intended for a specific commit or branch? If so, which revision should I be using?
You can do the 1M model lens. You can run any concurrency you just need to think about how many concurrencys you are going to be using and whether they will total 2M on a given turn. I set mine to 6 at 1M. 200k at 16 means you would need to use 16 sessions at a full 200k to hit 3M tokens which would then cause the vllm to pause sessions til cache is free. Alot this wont even happen with agentic work. It is really just a give or take. I run 1M at 6 Seqs because I use 6 agents on this work flow they all have 1M context and stagger in their work.
i compared speed settings that dont give you max KV, so i run 160K context as middle ground with 530K KV Cache on Mimo, your 1M context config is slower than mine
Will not be as good as that for everybody but maybe usable as Baseline, i checked for me and quality killed DS4 Flash as well as Omni Features, guess so if you dont need that much you’re way better off with DS4 Flash
Check my thread - I shared my recipe. You need at least 6k of batch chunks per 1 mtp token. Bad acceptance not only ruins speed but can affect quality. I ended up with 8k prefill batch and 1 token mtp
I need sometimes to look at screenshots of charts or websites - I wired a separate model with vision, works fine in both Hermes and Opencode.