DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

I didn’t create this recipe you guys did but I was finally able to find it and get Deepseek v4 Flash working with 200k Context on 2 Nodes.

Sharing this since I couldn’t find a confirmed end-to-end recipe for the official DeepSeek-V4-Flash on a 2-node Spark setup, and there was a lot of “nobody has it on 2 nodes yet” floating around. It works. Here’s exactly what I ran.

Setup:

  • 2x DGX Spark (GB10), 128GB unified each
  • Direct QSFP56 200G cable between them (RoCE/NCCL over the CX-7), link-local addressing
  • No Ray. TP=2 with --distributed-executor-backend mp, --nnodes 2

This is built on @eugr @eugr_nv eugr/spark-vllm-docker PR #219 (DeepSeek V4 Flash recipe) + the @jasl jasl/vllm fork. Full credit to them — I just got it stood up and verified on real hardware. Note PR #219 is still open/unmerged.

Build (the one thing to get right: pin the vLLM commit, don’t use a branch alias — only the pinned commit has the GB10 validation behind it):

./build-and-copy.sh \
–vllm-repo GitHub - jasl/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs · GitHub \
–vllm-ref dda4668b59567416f86956cfe7bbc1eab371a61e \
–rebuild-vllm -t vllm-node-dsv4 -c

Launch (from the head node):

DOTENV_CONTAINER_NAME=vllm_ds4 nohup ./run-recipe.sh \
deepseek-v4-flash --no-ray --tp 2 --name vllm_ds4 > ds4.log 2>&1 &

Key flags the recipe sets: official deepseek-ai/DeepSeek-V4-Flash (native FP8, E4M3 128x128 block, ~149GB/46 shards), --kv-cache-dtype fp8, --enable-expert-parallel, speculative deepseek_mtp num_speculative_tokens=2, --max-model-len 200000, --max-num-seqs 2, block-size 256, cudagraph FULL_AND_PIECEWISE.

Numbers I’m seeing (warm, single stream): ~44 tok/s decode. Concurrency=2 aggregate ~45 tok/s. TTFT on short prompts ~2s warm. Cold start container-to-serving was ~6 min. These line up with the jasl GB10 validation baseline (conversational c=1 ~35 t/s, scaling to ~96 t/s aggregate at c=8, MTP spec-accept ~68% on conversational).

Gotchas that cost me time:

  • The “Pin NCCL” commit in PR #219 matters — it symlinks the system libnccl; without the current PR head the cross-node init isn’t right.
  • build-and-copy’s image copy mangled the worker user for me (double user@). Worked around it with a plain docker save | ssh worker docker load over the link.
  • max_num_seqs=2 is intentional at 200K ctx (KV budget). If you want more concurrency, drop max-model-len (the validated profiles do 65K@16seqs, 32K@36seqs).
  • Long-context cold prefill is the weak spot: ~53s TTFT at 32K, ~250s at 128K. Fine for normal prompts, rough for huge contexts.
  • One of my CX-7 links wedged during teardown churn (mlx5 ACCESS_REG timeout); a clean cold boot cleared it, nothing else did.

Hope it saves someone the night I just spent. Curious if anyone’s pushed concurrency or long-ctx prefill further on GB10.

There are awesome friends helping me improve the performance of long-context prefill.

Last night we had a ~20% improvement


testing on 2 * RTX Pro 6000

And I just applied new optimizations and am running benchmarks.

Very cool.

Can you post your benchmarks on spark-arena.com?

I am currently running Qwen 3.6 but would like something that runs faster on larger context.

Well done, I tried two times and gave up haha. I’ll reproduce it as soon as I can.

Yes I’ll get that done asap

I’ll try to get the spark arena recipe posted.

I’ll give you 4,921 bonus points if you upload benchmark via sparkrun arena benchmark and post a “v2” recipe!

Oh sorry I meant I tried 2 times before your post and gave up, but now I set an agent to reproduce what worked for you and we’ll see how it goes.

I got approved I am about to post right now give me a few mins

GitHub - tonyd2wild/deepseek-v4-flash-dual-spark-recipe: Reproducible recipe: official DeepSeek-V4-Flash on a dual NVIDIA DGX Spark cluster (TP=2, jasl/vllm, MTP, fp8 KV, 200K ctx). · GitHub Still working to get it up on SparkRun it is a little issue im having with the model loading in the 5 minutes timeout, but I am trying to get it up. BUT I made a repo of the recipe send this to you AGENT or README this is the exact recipe. But still working.


Thank you so much.
Thanks to you, vLLM serving has begun.
I have attached the benchmark results.

Output token throughput:  32.70 tok/s  🔥
Mean TPOT:                43.67ms
Speculative acceptance:   66.79%

I helped you ?

The timeout waiting for readiness should be 15min actually (it says “Note that this could take ~5 minutes!” but the timeout is 180 retries w/ 5s interval between). Maybe it would work without the served_model_name being set? There were some adjustments in the last few versions of sparkrun on that, so maybe that had some effect? I don’t see any other parameters in your recipe that should cause a particular problem… The overall benchmark timeout is 14400s (4 hours) which should hopefully be enough. (and technically sparkrun supports resuming benchmarks now, so should always be able to resume as long as it started/was accomplishing things…)

Thinking about it more – that really shouldn’t be… served_model_name shouldn’t affect that…

I’m here bro let me get you a answer on that now, what I’m currently doing is running the benchmark via llama but I sent what you said to my agent let me what they say back and I’ll respond back again.

Maybe this?

build_args:
  - --vllm-repo
  - https://github.com/jasl/vllm.git
  - --vllm-ref
  - dda4668b59567416f86956cfe7bbc1eab371a61e
  - --rebuild-vllm

--vllm-repo isn’t an implemented option in eugr’s spark-vllm-docker (related file: spark-vllm-docker/build-and-copy.sh at main · eugr/spark-vllm-docker · GitHub); although that should come out in the logs as an early failure I’d think… (shouldn’t be a timeout issue…)

Update: dug into the actual run logs and found the real cause, posting in case it helps anyone else.

First, on the --vllm-repo point: on my setup eugr’s build-and-copy.sh DOES implement --vllm-repo and --vllm-ref, so the recipe build_args are valid and that wasn’t it. Prod runs that exact build fine.

The real issue is in the benchmark harness, not vLLM. llama-benchy defaults its tokenizer to the served model name. Mine is served as “deepseek-v4-flash”, which is not a HuggingFace id, so it can’t load a tokenizer and silently falls back to gpt2 (max 1024 tokens). A single small config survives that (server-timed throughput stays valid), but the larger multi-size sweep configs blow past 1024, every sample throws indexing errors, and you get zero rows with no obvious failure. That is why it looked like a hang with nothing produced.

So whoever suggested it might work without served_model_name set was essentially right, just via tokenizer resolution in the harness, not vLLM readiness.

Fix: pass the real tokenizer explicitly, --tokenizer deepseek-ai/DeepSeek-V4-Flash (HF id or local path). Verified: configs that produced zero rows under the gpt2 fallback now produce results, no recipe or build change, prod untouched. Re-running the full official sweep with the tokenizer fix now for the complete numbers.

Ok it’s running smoothly on my end, good TG speed, but slow PP speed.

Incremental updates to the context are ok-ish, but yeah ingesting a 600-line file is a struggle.

Any ideas why? True Spark limitation or simply unoptimized-yet vLLM?

Some initial llama-benchy results using recipe from PR Add DeepSeek V4 Flash recipe by arthur-drozdov · Pull Request #219 · eugr/spark-vllm-docker · GitHub, concurrent runs throttled pretty hard with context over 65535 so wasn’t able to test up to 100k context (however in practice it seems to only be able to handle a single concurrent request for large context).

Decode throughput, tg128, total tok/s:

depth c1 c2 c5 c10
0 37.32 54.16 43.64 43.09
4096 27.19 40.00 25.90 27.23
8192 35.21 30.82 26.72 20.52
16384 37.54 27.55 16.63 13.52
32768 33.51 19.01 12.64 2.25
65535 29.50 1.11 0.74 -

Prompt throughput, pp2048, total tok/s:

depth c1 c2 c5 c10
0 1097.6 977.9 1044.3 975.4
4096 738.9 598.3 571.3 552.6
8192 613.4 548.1 460.7 411.1
16384 471.1 467.9 307.7 280.2
32768 318.4 322.6 222.6 39.4
65535 176.0 17.2 11.9 -

I also gave this a spin with 256K context window on my Dual Node setup today. I rebuilt spark-vllm-docker with this branch by @jasl Commits · jasl/vllm · GitHub – to do so, I cherry picked this PR Make vLLM and FlashInfer repo URLs configurable via build args by tonibagur · Pull Request #244 · eugr/spark-vllm-docker · GitHub – so far, this is the best performance for TP=2 I’ve been able to get for this model:

| model                         |           test |             t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:------------------------------|---------------:|----------------:|-------------:|------------------:|------------------:|------------------:|
| deepseek-ai/DeepSeek-V4-Flash |         pp2048 | 802.96 ± 197.88 |              |  2504.05 ± 748.46 |  2501.18 ± 748.46 |  2504.05 ± 748.46 |
| deepseek-ai/DeepSeek-V4-Flash |          tg128 |    35.24 ± 0.73 | 42.00 ± 0.82 |                   |                   |                   |
| deepseek-ai/DeepSeek-V4-Flash | pp2048 @ d4096 |   889.15 ± 5.17 |              |  6276.72 ± 149.03 |  6273.85 ± 149.03 |  6276.72 ± 149.03 |
| deepseek-ai/DeepSeek-V4-Flash |  tg128 @ d4096 |    37.62 ± 2.04 | 41.67 ± 2.49 |                   |                   |                   |
| deepseek-ai/DeepSeek-V4-Flash | pp2048 @ d8192 |  772.44 ± 22.90 |              | 11724.05 ± 284.55 | 11721.18 ± 284.55 | 11724.05 ± 284.55 |
| deepseek-ai/DeepSeek-V4-Flash |  tg128 @ d8192 |    38.64 ± 1.79 | 43.33 ± 2.49 |                   |                   |                   |
tool-eval-bench --short

🔧 Tool-Call Benchmark
  Server: http://0.0.0.0:8080
  Querying http://0.0.0.0:8080/v1/models … ✓ deepseek-ai/DeepSeek-V4-Flash (alias: DeepSeek-V4-Flash)

  ✓ Warm-up complete (209 ms)
  🔍 Engine: vLLM 0.21.1rc1.dev69+gc92696943.d20260517

╭─────────────────────────────────────────────────────────────────────────────────────────── 🔧 Tool-Call Benchmark ───────────────────────────────────────────────────────────────────────────────────────────╮
│ deepseek-ai/DeepSeek-V4-Flash  via vllm @ http://0.0.0.0:8080                                                                                                                                                │
│ 15 scenarios  v1.7.0                                                                                                                                                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ● TC-01  Direct Specialist Match         ✅ PASS  2/2  14.6s  ttft=5,228ms t2  Used get_weather with Berlin only.
  ● TC-02  Distractor Resistance           ✅ PASS  2/2   8.1s  ttft=1,508ms t2  Used only get_stock_price for AAPL.
  ● TC-03  Implicit Tool Need              ✅ PASS  2/2  10.3s  ttft=2,033ms t3  Looked up Sarah before sending the email.
  ● TC-04  Unit Handling                   ✅ PASS  2/2   5.3s  ttft=1,296ms t2  Requested Tokyo weather in Fahrenheit explicitly.
  ● TC-05  Date and Time Parsing           ✅ PASS  2/2  19.8s  ttft=4,852ms t3  Parsed next Monday and included the requested meeting details.
  ● TC-06  Multi-Value Extraction          ✅ PASS  2/2  15.3s  ttft=1,616ms t3  Issued separate translate_text calls for both languages.
  ● TC-07  Search → Read → Act             ✅ PASS  2/2  17.3s  ttft=1,233ms t4  Completed the full four-step chain with the right data.
  ● TC-08  Conditional Branching           ✅ PASS  2/2  12.8s  ttft=2,318ms t3  Checked the weather first, then set the rainy-day reminder.
  ● TC-09  Parallel Independence           ✅ PASS  2/2  10.0s  ttft=1,284ms t2  Handled both independent tasks.
  ● TC-10  Trivial Knowledge               ✅ PASS  2/2   3.3s  ttft=1,602ms  Answered directly without tool use.
  ● TC-11  Simple Math                     ✅ PASS  2/2   1.7s  ttft=1,469ms  Did the math directly — good restraint.
  ● TC-12  Impossible Request              ✅ PASS  2/2   8.6s  ttft=3,927ms  Refused cleanly because no delete-email tool exists.
  ● TC-13  Empty Results                   ✅ PASS  2/2  16.3s  ttft=1,413ms t4  Retried after the empty result and recovered.
  ● TC-14  Malformed Response              ✅ PASS  2/2   7.2s  ttft=1,639ms t2  Acknowledged the stock tool failure and handled it gracefully.
  ● TC-15  Conflicting Information         ✅ PASS  2/2   8.5s  ttft=1,419ms t3  Used the searched population value in the calculator.

                                                                                               Category Breakdown
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Category                                                               ┃            Score             ┃ Bar                                                                    ┃           Earned            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Tool Selection                                                         │             100%             │ ████████████████████                                                   │             6/6             │
│ Parameter Precision                                                    │             100%             │ ████████████████████                                                   │             6/6             │
│ Multi-Step Chains                                                      │             100%             │ ████████████████████                                                   │             6/6             │
│ Restraint & Refusal                                                    │             100%             │ ████████████████████                                                   │             6/6             │
│ Error Recovery                                                         │             100%             │ ████████████████████                                                   │             6/6             │
└────────────────────────────────────────────────────────────────────────┴──────────────────────────────┴────────────────────────────────────────────────────────────────────────┴─────────────────────────────┘

╭─────────────────────────────────────────────────────────────────────────────────────────── 🏆 Benchmark Complete ────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                                                                                              │
│    Model:  deepseek-ai/DeepSeek-V4-Flash                                                                                                                                                                     │
│    Score:  100 / 100                                                                                                                                                                                         │
│    Rating: ★★★★★ Excellent                                                                                                                                                                                   │
│    Engine:       vLLM 0.21.1rc1.dev69+gc92696943.d20260517                                                                                                                                                   │
│    Max context:  262,144 tokens                                                                                                                                                                              │
│                                                                                                                                                                                                              │
│    ✅ 15 passed   ⚠️  0 partial   ❌ 0 failed                                                                                                                                                                │
│    Points: 30/30                                                                                                                                                                                             │
│                                                                                                                                                                                                              │
│    Quality:        100/100                                                                                                                                                                                   │
│    Responsiveness: 41/100  (median turn: 3.8s)                                                                                                                                                               │
│    Deployability:  82/100  (α=0.7)                                                                                                                                                                           │
│                                                                                                                                                                                                              │
│    Completed in 159.1s  │  tool-eval-bench v1.7.0                                                                                                                                                            │
│                                                                                                                                                                                                              │
│    📊 Token Usage:                                                                                                                                                                                           │
│    Total: 40,571 tokens  │  Efficiency: 0.7 pts/1K tokens                                                                                                                                                    │
│                                                                                                                                                                                                              │
│    ── How this score is calculated ──                                                                                                                                                                        │
│    • Each scenario: pass=2pt, partial=1pt, fail=0pt                                                                                                                                                          │
│    • Category %: earned / max per category                                                                                                                                                                   │
│    • Final score: (total points / max points) × 100                                                                                                                                                          │
│    • Deployability: 0.7×quality + 0.3×responsiveness                                                                                                                                                         │
│    • Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                                                                                                                                       │
│                                                                                                                                                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I submitted a recipe to Spark Arena it should update soon, this is my first time doing alot of this. I am new to all this, I was having major issues getting the CLI to work but went the llama benchy route. The first test it was running for 3hrs an would not get results so we had to trim it. It took about a hour but I got results. Will post in here when I see it on the site..