DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

tonyd615 · May 16, 2026, 2:31am

I didn’t create this recipe you guys did but I was finally able to find it and get Deepseek v4 Flash working with 200k Context on 2 Nodes.

Sharing this since I couldn’t find a confirmed end-to-end recipe for the official DeepSeek-V4-Flash on a 2-node Spark setup, and there was a lot of “nobody has it on 2 nodes yet” floating around. It works. Here’s exactly what I ran.

Setup:

2x DGX Spark (GB10), 128GB unified each
Direct QSFP56 200G cable between them (RoCE/NCCL over the CX-7), link-local addressing
No Ray. TP=2 with --distributed-executor-backend mp, --nnodes 2

This is built on @eugr @eugr_nv eugr/spark-vllm-docker PR #219 (DeepSeek V4 Flash recipe) + the @jasl jasl/vllm fork. Full credit to them — I just got it stood up and verified on real hardware. Note PR #219 is still open/unmerged.

Build (the one thing to get right: pin the vLLM commit, don’t use a branch alias — only the pinned commit has the GB10 validation behind it):

./build-and-copy.sh \
–vllm-repo GitHub - jasl/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs · GitHub \
–vllm-ref dda4668b59567416f86956cfe7bbc1eab371a61e \
–rebuild-vllm -t vllm-node-dsv4 -c

Launch (from the head node):

DOTENV_CONTAINER_NAME=vllm_ds4 nohup ./run-recipe.sh \
deepseek-v4-flash --no-ray --tp 2 --name vllm_ds4 > ds4.log 2>&1 &

Key flags the recipe sets: official deepseek-ai/DeepSeek-V4-Flash (native FP8, E4M3 128x128 block, ~149GB/46 shards), --kv-cache-dtype fp8, --enable-expert-parallel, speculative deepseek_mtp num_speculative_tokens=2, --max-model-len 200000, --max-num-seqs 2, block-size 256, cudagraph FULL_AND_PIECEWISE.

Numbers I’m seeing (warm, single stream): ~44 tok/s decode. Concurrency=2 aggregate ~45 tok/s. TTFT on short prompts ~2s warm. Cold start container-to-serving was ~6 min. These line up with the jasl GB10 validation baseline (conversational c=1 ~35 t/s, scaling to ~96 t/s aggregate at c=8, MTP spec-accept ~68% on conversational).

Gotchas that cost me time:

The “Pin NCCL” commit in PR #219 matters — it symlinks the system libnccl; without the current PR head the cross-node init isn’t right.
build-and-copy’s image copy mangled the worker user for me (double user@). Worked around it with a plain docker save | ssh worker docker load over the link.
max_num_seqs=2 is intentional at 200K ctx (KV budget). If you want more concurrency, drop max-model-len (the validated profiles do 65K@16seqs, 32K@36seqs).
Long-context cold prefill is the weak spot: ~53s TTFT at 32K, ~250s at 128K. Fine for normal prompts, rough for huge contexts.
One of my CX-7 links wedged during teardown churn (mlx5 ACCESS_REG timeout); a clean cold boot cleared it, nothing else did.

Hope it saves someone the night I just spent. Curious if anyone’s pushed concurrency or long-ctx prefill further on GB10.

jasl · May 16, 2026, 1:27pm

There are awesome friends helping me improve the performance of long-context prefill.

Last night we had a ~20% improvement

testing on 2 * RTX Pro 6000

And I just applied new optimizations and am running benchmarks.

Keyper-AI · May 16, 2026, 1:49pm

Very cool.

Can you post your benchmarks on spark-arena.com?

I am currently running Qwen 3.6 but would like something that runs faster on larger context.

co-le · May 16, 2026, 4:05pm

Well done, I tried two times and gave up haha. I’ll reproduce it as soon as I can.

tonyd615 · May 16, 2026, 7:02pm

Yes I’ll get that done asap

tonyd615 · May 16, 2026, 7:02pm

I’ll try to get the spark arena recipe posted.

dbsci · May 16, 2026, 7:11pm

I’ll give you 4,921 bonus points if you upload benchmark via sparkrun arena benchmark and post a “v2” recipe!

co-le · May 16, 2026, 7:27pm

Oh sorry I meant I tried 2 times before your post and gave up, but now I set an agent to reproduce what worked for you and we’ll see how it goes.

tonyd615 · May 16, 2026, 11:18pm

I got approved I am about to post right now give me a few mins

tonyd615 · May 17, 2026, 12:33am

GitHub - tonyd2wild/deepseek-v4-flash-dual-spark-recipe: Reproducible recipe: official DeepSeek-V4-Flash on a dual NVIDIA DGX Spark cluster (TP=2, jasl/vllm, MTP, fp8 KV, 200K ctx). · GitHub Still working to get it up on SparkRun it is a little issue im having with the model loading in the 5 minutes timeout, but I am trying to get it up. BUT I made a repo of the recipe send this to you AGENT or README this is the exact recipe. But still working.

back199640 · May 17, 2026, 12:36am

Thank you so much.
Thanks to you, vLLM serving has begun.
I have attached the benchmark results.

Output token throughput:  32.70 tok/s  🔥
Mean TPOT:                43.67ms
Speculative acceptance:   66.79%

tonyd615 · May 17, 2026, 12:37am

I helped you ?

dbsci · May 17, 2026, 2:33am

The timeout waiting for readiness should be 15min actually (it says “Note that this could take ~5 minutes!” but the timeout is 180 retries w/ 5s interval between). Maybe it would work without the served_model_name being set? There were some adjustments in the last few versions of sparkrun on that, so maybe that had some effect? I don’t see any other parameters in your recipe that should cause a particular problem… The overall benchmark timeout is 14400s (4 hours) which should hopefully be enough. (and technically sparkrun supports resuming benchmarks now, so should always be able to resume as long as it started/was accomplishing things…)

Thinking about it more – that really shouldn’t be… served_model_name shouldn’t affect that…

tonyd615 · May 17, 2026, 2:49am

I’m here bro let me get you a answer on that now, what I’m currently doing is running the benchmark via llama but I sent what you said to my agent let me what they say back and I’ll respond back again.

dbsci · May 17, 2026, 2:58am

Maybe this?

build_args:
  - --vllm-repo
  - https://github.com/jasl/vllm.git
  - --vllm-ref
  - dda4668b59567416f86956cfe7bbc1eab371a61e
  - --rebuild-vllm

--vllm-repo isn’t an implemented option in eugr’s spark-vllm-docker (related file: spark-vllm-docker/build-and-copy.sh at main · eugr/spark-vllm-docker · GitHub); although that should come out in the logs as an early failure I’d think… (shouldn’t be a timeout issue…)

tonyd615 · May 17, 2026, 3:22am

Update: dug into the actual run logs and found the real cause, posting in case it helps anyone else.

First, on the --vllm-repo point: on my setup eugr’s build-and-copy.sh DOES implement --vllm-repo and --vllm-ref, so the recipe build_args are valid and that wasn’t it. Prod runs that exact build fine.

The real issue is in the benchmark harness, not vLLM. llama-benchy defaults its tokenizer to the served model name. Mine is served as “deepseek-v4-flash”, which is not a HuggingFace id, so it can’t load a tokenizer and silently falls back to gpt2 (max 1024 tokens). A single small config survives that (server-timed throughput stays valid), but the larger multi-size sweep configs blow past 1024, every sample throws indexing errors, and you get zero rows with no obvious failure. That is why it looked like a hang with nothing produced.

So whoever suggested it might work without served_model_name set was essentially right, just via tokenizer resolution in the harness, not vLLM readiness.

Fix: pass the real tokenizer explicitly, --tokenizer deepseek-ai/DeepSeek-V4-Flash (HF id or local path). Verified: configs that produced zero rows under the gpt2 fallback now produce results, no recipe or build change, prod untouched. Re-running the full official sweep with the tokenizer fix now for the complete numbers.

co-le · May 17, 2026, 10:10am

Ok it’s running smoothly on my end, good TG speed, but slow PP speed.

Incremental updates to the context are ok-ish, but yeah ingesting a 600-line file is a struggle.

Any ideas why? True Spark limitation or simply unoptimized-yet vLLM?

arthurdroz · May 17, 2026, 12:57pm

Some initial llama-benchy results using recipe from PR Add DeepSeek V4 Flash recipe by arthur-drozdov · Pull Request #219 · eugr/spark-vllm-docker · GitHub, concurrent runs throttled pretty hard with context over 65535 so wasn’t able to test up to 100k context (however in practice it seems to only be able to handle a single concurrent request for large context).

Decode throughput, tg128, total tok/s:

depth	c1	c2	c5	c10
0	37.32	54.16	43.64	43.09
4096	27.19	40.00	25.90	27.23
8192	35.21	30.82	26.72	20.52
16384	37.54	27.55	16.63	13.52
32768	33.51	19.01	12.64	2.25
65535	29.50	1.11	0.74	-

Prompt throughput, pp2048, total tok/s:

depth	c1	c2	c5	c10
0	1097.6	977.9	1044.3	975.4
4096	738.9	598.3	571.3	552.6
8192	613.4	548.1	460.7	411.1
16384	471.1	467.9	307.7	280.2
32768	318.4	322.6	222.6	39.4
65535	176.0	17.2	11.9	-

serapis · May 17, 2026, 1:09pm

I also gave this a spin with 256K context window on my Dual Node setup today. I rebuilt spark-vllm-docker with this branch by @jasl Commits · jasl/vllm · GitHub – to do so, I cherry picked this PR Make vLLM and FlashInfer repo URLs configurable via build args by tonibagur · Pull Request #244 · eugr/spark-vllm-docker · GitHub – so far, this is the best performance for TP=2 I’ve been able to get for this model:

| model                         |           test |             t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:------------------------------|---------------:|----------------:|-------------:|------------------:|------------------:|------------------:|
| deepseek-ai/DeepSeek-V4-Flash |         pp2048 | 802.96 ± 197.88 |              |  2504.05 ± 748.46 |  2501.18 ± 748.46 |  2504.05 ± 748.46 |
| deepseek-ai/DeepSeek-V4-Flash |          tg128 |    35.24 ± 0.73 | 42.00 ± 0.82 |                   |                   |                   |
| deepseek-ai/DeepSeek-V4-Flash | pp2048 @ d4096 |   889.15 ± 5.17 |              |  6276.72 ± 149.03 |  6273.85 ± 149.03 |  6276.72 ± 149.03 |
| deepseek-ai/DeepSeek-V4-Flash |  tg128 @ d4096 |    37.62 ± 2.04 | 41.67 ± 2.49 |                   |                   |                   |
| deepseek-ai/DeepSeek-V4-Flash | pp2048 @ d8192 |  772.44 ± 22.90 |              | 11724.05 ± 284.55 | 11721.18 ± 284.55 | 11724.05 ± 284.55 |
| deepseek-ai/DeepSeek-V4-Flash |  tg128 @ d8192 |    38.64 ± 1.79 | 43.33 ± 2.49 |                   |                   |                   |

tool-eval-bench --short

🔧 Tool-Call Benchmark
  Server: http://0.0.0.0:8080
  Querying http://0.0.0.0:8080/v1/models … ✓ deepseek-ai/DeepSeek-V4-Flash (alias: DeepSeek-V4-Flash)

  ✓ Warm-up complete (209 ms)
  🔍 Engine: vLLM 0.21.1rc1.dev69+gc92696943.d20260517

╭─────────────────────────────────────────────────────────────────────────────────────────── 🔧 Tool-Call Benchmark ───────────────────────────────────────────────────────────────────────────────────────────╮
│ deepseek-ai/DeepSeek-V4-Flash  via vllm @ http://0.0.0.0:8080                                                                                                                                                │
│ 15 scenarios  v1.7.0                                                                                                                                                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ● TC-01  Direct Specialist Match         ✅ PASS  2/2  14.6s  ttft=5,228ms t2  Used get_weather with Berlin only.
  ● TC-02  Distractor Resistance           ✅ PASS  2/2   8.1s  ttft=1,508ms t2  Used only get_stock_price for AAPL.
  ● TC-03  Implicit Tool Need              ✅ PASS  2/2  10.3s  ttft=2,033ms t3  Looked up Sarah before sending the email.
  ● TC-04  Unit Handling                   ✅ PASS  2/2   5.3s  ttft=1,296ms t2  Requested Tokyo weather in Fahrenheit explicitly.
  ● TC-05  Date and Time Parsing           ✅ PASS  2/2  19.8s  ttft=4,852ms t3  Parsed next Monday and included the requested meeting details.
  ● TC-06  Multi-Value Extraction          ✅ PASS  2/2  15.3s  ttft=1,616ms t3  Issued separate translate_text calls for both languages.
  ● TC-07  Search → Read → Act             ✅ PASS  2/2  17.3s  ttft=1,233ms t4  Completed the full four-step chain with the right data.
  ● TC-08  Conditional Branching           ✅ PASS  2/2  12.8s  ttft=2,318ms t3  Checked the weather first, then set the rainy-day reminder.
  ● TC-09  Parallel Independence           ✅ PASS  2/2  10.0s  ttft=1,284ms t2  Handled both independent tasks.
  ● TC-10  Trivial Knowledge               ✅ PASS  2/2   3.3s  ttft=1,602ms  Answered directly without tool use.
  ● TC-11  Simple Math                     ✅ PASS  2/2   1.7s  ttft=1,469ms  Did the math directly — good restraint.
  ● TC-12  Impossible Request              ✅ PASS  2/2   8.6s  ttft=3,927ms  Refused cleanly because no delete-email tool exists.
  ● TC-13  Empty Results                   ✅ PASS  2/2  16.3s  ttft=1,413ms t4  Retried after the empty result and recovered.
  ● TC-14  Malformed Response              ✅ PASS  2/2   7.2s  ttft=1,639ms t2  Acknowledged the stock tool failure and handled it gracefully.
  ● TC-15  Conflicting Information         ✅ PASS  2/2   8.5s  ttft=1,419ms t3  Used the searched population value in the calculator.

                                                                                               Category Breakdown
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Category                                                               ┃            Score             ┃ Bar                                                                    ┃           Earned            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Tool Selection                                                         │             100%             │ ████████████████████                                                   │             6/6             │
│ Parameter Precision                                                    │             100%             │ ████████████████████                                                   │             6/6             │
│ Multi-Step Chains                                                      │             100%             │ ████████████████████                                                   │             6/6             │
│ Restraint & Refusal                                                    │             100%             │ ████████████████████                                                   │             6/6             │
│ Error Recovery                                                         │             100%             │ ████████████████████                                                   │             6/6             │
└────────────────────────────────────────────────────────────────────────┴──────────────────────────────┴────────────────────────────────────────────────────────────────────────┴─────────────────────────────┘

╭─────────────────────────────────────────────────────────────────────────────────────────── 🏆 Benchmark Complete ────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                                                                                              │
│    Model:  deepseek-ai/DeepSeek-V4-Flash                                                                                                                                                                     │
│    Score:  100 / 100                                                                                                                                                                                         │
│    Rating: ★★★★★ Excellent                                                                                                                                                                                   │
│    Engine:       vLLM 0.21.1rc1.dev69+gc92696943.d20260517                                                                                                                                                   │
│    Max context:  262,144 tokens                                                                                                                                                                              │
│                                                                                                                                                                                                              │
│    ✅ 15 passed   ⚠️  0 partial   ❌ 0 failed                                                                                                                                                                │
│    Points: 30/30                                                                                                                                                                                             │
│                                                                                                                                                                                                              │
│    Quality:        100/100                                                                                                                                                                                   │
│    Responsiveness: 41/100  (median turn: 3.8s)                                                                                                                                                               │
│    Deployability:  82/100  (α=0.7)                                                                                                                                                                           │
│                                                                                                                                                                                                              │
│    Completed in 159.1s  │  tool-eval-bench v1.7.0                                                                                                                                                            │
│                                                                                                                                                                                                              │
│    📊 Token Usage:                                                                                                                                                                                           │
│    Total: 40,571 tokens  │  Efficiency: 0.7 pts/1K tokens                                                                                                                                                    │
│                                                                                                                                                                                                              │
│    ── How this score is calculated ──                                                                                                                                                                        │
│    • Each scenario: pass=2pt, partial=1pt, fail=0pt                                                                                                                                                          │
│    • Category %: earned / max per category                                                                                                                                                                   │
│    • Final score: (total points / max points) × 100                                                                                                                                                          │
│    • Deployability: 0.7×quality + 0.3×responsiveness                                                                                                                                                         │
│    • Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                                                                                                                                       │
│                                                                                                                                                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

tonyd615 · May 17, 2026, 3:47pm

I submitted a recipe to Spark Arena it should update soon, this is my first time doing alot of this. I am new to all this, I was having major issues getting the CLI to work but went the llama benchy route. The first test it was running for 3hrs an would not get results so we had to trim it. It took about a hour but I got results. Will post in here when I see it on the site..

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	51	4356	June 6, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	15110	May 18, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1644	May 11, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	65	5229	May 30, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	2144	May 17, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1137	June 4, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	8333	March 28, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4488	February 27, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2547	December 25, 2025
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5791	March 16, 2026

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Related topics