DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

paxren2020 · May 30, 2026, 7:52pm

Could you please guide me on how to properly build the image? I’m a bit confused. =(

wolttam · May 30, 2026, 8:50pm

git clone https://github.com/eugr/spark-vllm-docker
cd spark-vllm-docker
curl -L https://gist.githubusercontent.com/mlow/fc04615043b4cb9938e7be5719aa6aca/raw/d581c5db04d5ac534fe978ed371b289107dae231/deepseek-v4-flash.yaml -o recipes/deepseek-v4-flash.yaml
./run-recipe.py deepseek-v4-flash --no-ray -n <node-a-IP>,<node-b-IP> --setup

You can play with the flags to run-recipe.py to force rebuild the image and such

tonyd615 · May 30, 2026, 10:24pm

What did the update to VLLM do today to the recipe if anything ?

peter99512 · May 31, 2026, 1:29am

Thank you for the website! this is very well documented and detailed. 40t/s is already a good start to consider setup and running. Is this stable to run with continues agentic work now?

susni · May 31, 2026, 1:37am

yes and no, it’s performance is great now but for some reason it likes to hold onto old context and it fills up the GPU KV cache quite fast with MTP enabled.

peter99512 · May 31, 2026, 1:38am

I see, good for testing but need to wait more time to address these issues.

susni · May 31, 2026, 2:19am

a watcher to occasionally restart the server and good context management might hold you over until it’s fixed

ekkis · May 31, 2026, 6:38am

There’s definitely something buggy with how the kv cache is kept and prefix cache is sometimes invalidated for this model in vllm, but apart from that I’m really enjoying it.

I rebuilt the image today and for some reason got a massive kv cache boost, up to 4x concurrency now at 300k context from 1.9x before. This model doesn’t suffer from context rot like Minimax M2.7 does and stays effective at 200k context and beyond, I went to about 270k yesterday without issue. It’s also very thorough and builds good plans and executes them well. It will stay as my daily driver for now, but hoping Minimax M3 might replace it next week.

corbett_korbett · May 31, 2026, 7:01am

How is it comparing to minimax overall? How does the code quality and most importantly general world knowledge compare to minimax? Really looking for a model to run as my daily on dual sparks that can actually handle AI model training questions and code as well as general questions well.

tonyd615 · May 31, 2026, 8:22am

can you publish image ?

serapis · May 31, 2026, 9:04am

I also noticed model instability and random crashes.

I couldn’t put my finger on what causes these issues given they seem to occur at random.

paxren2020 · May 31, 2026, 9:11am

This is most likely related to the changes in @jasl9187 's PR. Judging by the PR history, there were a lot of edits made yesterday and throughout the week.

We can actually build the image straight from the official community base image by applying just that single PR.

./build-and-copy.sh -t vllm-node-220-1-41834-ds4 --apply-vllm-pr 41834 --rebuild-vllm --cleanup -c

github.com/vllm-project/vllm

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes (#41834)

main ← jasl:codex/ds4-sm120-min-enable

opened 03:17PM - 06 May 26 UTC

jasl

+12103 -302

## Latest validation snapshot (2026-05-29) Hardware: 2x NVIDIA RTX PRO 6000 Bla…ckwell Workstation Edition (SM120). PR head: [`0236f021d`](https://github.com/jasl/vllm/commit/0236f021d7f6fd3b1af290c153547a11659e08a1). Baseline labels: `20260529_mhc_fix_pr_perf_gate_baseline_primary`, `20260529_pr43477_temp0_c124_comparison`. Primary serve profile: TP=2, MTP=2, FP8 KV cache, block size 256, max model len 131072, prefix cache disabled for cold-latency baselines, and `FULL_AND_PIECEWISE` CUDA graph compilation. The focused temp=0 random check used max model len 65536. All primary phases exited 0: server startup, 59K/124K latency matrix, frontier context sweep, DS4 story semantic gate, long-context decode concurrency, mixed arrival, streaming pressure, HF mt-bench, GSM8K, random prefill sweep, random 8K/1K, and random 256/256. Runtime error counters stayed at 0 in primary phases. ### Correctness `lm_eval` GSM8K, 5-shot, 200 questions, `temperature=0`: | Variant | exact_match_flexible | exact_match_strict | | --- | ---: | ---: | | MTP=2 | 95.5% | 93.5% | DS4 story recall semantic gate: 30,502 prompt tokens, matched all 16 `Name=number` assignments, TTFT 5.833 s, decode 168.37 tok/s, ITL p99 18 ms. ### HF mt-bench throughput (`temperature=1.0`, MTP=2) | C | Output tok/s | TTFT mean | TPOT mean | MTP acceptance | Acceptance length | | ---: | ---: | ---: | ---: | ---: | ---: | | 1 | 154.26 | 93.83 ms | 5.98 ms | 67.69% | 2.35 | | 2 | 240.38 | 119.40 ms | 7.58 ms | 68.17% | 2.36 | | 4 | 352.57 | 152.77 ms | 10.16 ms | 67.68% | 2.35 | | 8 | 362.16 | 2070.98 ms | 10.17 ms | 68.08% | 2.36 | | 16 | 357.80 | 5767.72 ms | 10.20 ms | 67.22% | 2.34 | | 24 | 362.20 | 8669.22 ms | 10.08 ms | 68.76% | 2.38 | ### Random 8K/1K throughput (`temperature=0.0`, 16 prompts) | Variant | C | Output tok/s | TTFT mean | TPOT mean | MTP acceptance | Acceptance length | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | no-MTP | 1 | 90.51 | 1351.74 ms | 9.71 ms | - | - | | no-MTP | 2 | 142.91 | 2036.42 ms | 11.97 ms | - | - | | no-MTP | 4 | 211.03 | 3334.25 ms | 15.63 ms | - | - | | MTP=2 | 1 | 153.47 | 1380.52 ms | 5.14 ms | 82.56% | 2.65 | | MTP=2 | 2 | 220.51 | 1574.53 ms | 7.39 ms | 86.33% | 2.73 | | MTP=2 | 4 | 275.58 | 2054.14 ms | 11.74 ms | 81.03% | 2.62 | `temperature=1.0` random 8K/1K remains a separate stochastic workload: C=1/2/4/8/16/32 output tok/s = 110.41 / 160.55 / 222.91 / 231.44 / 228.34 / 231.31, with MTP acceptance around 51%. Please compare random throughput only with the sampling temperature included. ### Long-context latency (`temperature=0.0`, prefix cache disabled, MTP=2) | Prompt tokens | C | TTFT mean | TTFT max | Decode mean tok/s | Fairness min/max | ITL p99 | Failures | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 58,980 | 1 | 12.160 s | 12.190 s | 137.05 | 0.959 | 22 ms | 0 | | 58,980 | 2 | 19.272 s | 25.921 s | 82.56 | 0.237 | 90 ms | 0 | | 124,080 | 1 | 30.990 s | 31.030 s | 105.81 | 0.998 | 29 ms | 0 | | 124,080 | 2 | 47.870 s | 63.927 s | 66.41 | 0.286 | 94 ms | 0 | Long-context decode concurrency at 124K: C=1 decode 105.99 tok/s, ITL p99 29 ms; C=2 decode mean 64.34 tok/s, min/max 30.86 / 97.82 tok/s, ITL p99 96 ms, failures 0. ### DS4 prompt-file frontier sweep | Prompt | Target frontier | Prompt tokens | TTFT | Input tok/s | Decode tok/s | ITL p99 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | security audit | 65,536 | 17,531 | 3.268 s | 5364.86 | 164.39 | 82 ms | | security audit | 98,304 | 27,991 | 5.345 s | 5237.08 | 152.87 | 89 ms | | security audit | 124,000 | 36,649 | 7.232 s | 5067.79 | 118.46 | 23 ms | | story recall | 65,536 | 16,089 | 2.886 s | 5575.63 | 178.39 | 17 ms | | story recall | 98,304 | 24,149 | 4.483 s | 5387.39 | 167.18 | 18 ms | | story recall | 124,000 | 30,478 | 5.867 s | 5194.51 | 168.03 | 18 ms | ### Mixed / continuous pressure Mixed-arrival gates all passed with 0 failures: - `decode_then_59k`: primary TTFT mean 12.613 s, secondary TTFT mean 13.505 s, secondary ITL p99 22 ms. - `decode_then_124k`: primary TTFT mean 31.420 s, secondary TTFT mean 32.255 s, secondary ITL p99 29 ms. - `long_then_short`: primary TTFT mean 31.814 s, secondary TTFT mean 30.378 s, secondary ITL p99 32 ms. Streaming pressure matrix: 4 cases, 36 requests, 0 failures, 0 slow cases, max prompt 100,127 tokens, max TTFT 57.591 s, ITL p95 409 ms, ITL p99 737 ms. Prefix-cache stress with fillers 100 / 400 / 800 / 1600 / 3200 all passed server startup and stress phases. ### Random prefill sweep (`temperature=0.0`, MTP=2) | Input tokens | Input tok/s | TTFT mean | | ---: | ---: | ---: | | 1K | 6159.40 | 165.47 ms | | 4K | 5990.49 | 683.97 ms | | 16K | 5504.91 | 2975.94 ms | | 65K | 4570.95 | 14337.38 ms | ### Known caveats - The current dual RTX PRO 6000 validation host covers the 128K-class range. 256K, 512K, and 1M context claims still need a larger GPU budget / 4-card validation. - Mixed long-context C=2 fairness remains the main optimization target: the 124K decode-concurrency run still has a 30.86 / 97.82 tok/s per-request spread. - Random throughput must always state temperature. At `temperature=1.0`, MTP acceptance is much lower than at `temperature=0.0`, so tok/s is not directly comparable across those settings. --- ## Historical validation notes ## Purpose Enable DeepSeek V4 Flash on SM12x Blackwell consumer hardware (RTX PRO 6000 Workstation Edition, RTX 5090, DGX Spark GB10). The core challenge: SM12x lacks the TMEM / `tcgen05` instructions present on datacenter Blackwell (SM10x), so DeepGEMM, FlashMLA, and Marlin's FP8 paths fail at kernel link time on this hardware. This PR provides pure-PyTorch fallbacks, Triton kernel implementations, and SM12x-specific tuning so the model runs end-to-end with production-quality perf. ## Validation results Hardware: 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition. PR head: [`3424fba51`](https://github.com/jasl/vllm/commit/3424fba51301504262c3d8355e2560469f18c9c4). Rebased on `upstream/main` 2026-05-19; NCCL: `nvidia-nccl-cu13` 2.30.4. Reference long-context serve config used for the 2026-05-18 run: ```bash vllm serve deepseek-ai/DeepSeek-V4-Flash \ --kv-cache-dtype fp8 \ --block-size 256 \ --max-model-len 131072 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.98 \ --max-num-seqs 4 \ --max-num-batched-tokens 4096 \ --tokenizer-mode deepseek_v4 \ --tool-call-parser deepseek_v4 \ --enable-auto-tool-choice \ --reasoning-parser deepseek_v4 \ --reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}' \ --speculative_config '{"method":"mtp","num_speculative_tokens":2}' \ --no-enable-prefix-caching \ --no-enable-flashinfer-autotune ``` `--no-enable-prefix-caching` is set for these latency baselines so cold prefill numbers are not biased by cache hits. End-user document/chat deployments should generally keep prefix caching enabled. ### Accuracy `lm_eval` `gsm8k` 5-shot, 200 questions, `temperature=0`, `max_gen_toks=2048`, via `/v1/completions`: | Variant | strict-match | | --- | ---: | | no-MTP | 95.5% | | MTP=2 | 95.0% | Within the historical 0.948-0.965 band on this model. ### Performance (mt-bench, `philschmid/mt-bench`, 80 prompts) | c | no-MTP TPOT (ms) | no-MTP tok/s | no-MTP TTFT med (ms) | MTP=2 TPOT (ms) | MTP=2 tok/s | MTP=2 TTFT med (ms) | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 1 | 9.9 | 98 | 53 | **5.7** | **165** | 59 | | 2 | 11.7 | 163 | 72 | 7.5 | 248 | 84 | | 4 | 14.4 | 266 | 77 | 11.8 | 316 | 100 | | 8 | 19.4 | 380 | 86 | 13.7 | 530 | 117 | | 16 | 27.3 | 520 | 120 | 19.3 | 721 | 161 | | 24 | 32.6 | 607 | 391 | **23.2** | **846** | 194 | MTP=2 peak: **165 tok/s single-stream**, **846 tok/s @ c=24**. MTP=2 acceptance length 2.35-2.38 on real-content prompts, pos-0 acceptance 84-85%. ### Long-context prefill Earlier long-context work in this PR added `_accumulate_indexed_attention_chunk_multihead_kernel` (HEAD_BLOCK=8) and overlapped the C128A prefill KV gather with the indexer forward. The latest two commits add a direct SM120 MQA top-k fallback path: Triton materializes the FP8 MQA logits, then the existing custom `top_k_per_row_prefill` op selects top-k row indices without the slower PyTorch per-chunk score path. Dedicated 128K A/B sweep on the same 2x RTX PRO 6000 setup, C=1, cold, `max_tokens=64`: | Build | 127,056-token TTFT mean | Delta vs parent | | --- | ---: | ---: | | Before direct MQA top-k fallback | 60.83 s | - | | Triton MQA logits + PyTorch top-k (`f32b9e782`) | 37.65 s | -38.1% | | Triton MQA logits + custom row top-k (`709f50d10`) | 36.87 s | -2.1% | Conservative full-validation rerun on 2026-05-18, C=1, cold, `max_tokens=64`, repeat=3: | Prompt tokens | C | TTFT mean | TTFT max | Elapsed mean | Failures | | ---: | ---: | ---: | ---: | ---: | ---: | | 63,568 | 1 | 14.71 s | 14.84 s | 15.16 s | 0 / 3 | | 127,056 | 1 | 38.23 s | 38.38 s | 38.84 s | 0 / 3 | That conservative rerun is **37.1% lower TTFT** than the pre-top-k 128K baseline (60.83 s -> 38.23 s), a **1.59x** speedup. Small-concurrency long-context matrix on 2026-05-18, cold salted prompts, `max_tokens=128`, repeat=2, prefix cache disabled: | Prompt tokens | C | Requests | TTFT mean | TTFT max | Elapsed mean | Failures | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 63,568 | 1 | 2 | 14.96 s | 15.05 s | 15.45 s | 0 | | 63,568 | 2 | 4 | 23.45 s | 32.18 s | 31.61 s | 0 | | 63,568 | 4 | 8 | 38.96 s | 63.83 s | 55.67 s | 0 | | 127,056 | 1 | 2 | 39.93 s | 41.66 s | 40.59 s | 0 | | 127,056 | 2 | 4 | 58.68 s | 80.52 s | 69.53 s | 0 | | 127,056 | 4 | 8 | 99.10 s | 162.16 s | 122.25 s | 0 | Short-context warmed regression check, 4,047 prompt tokens, cold salted prompts, `max_tokens=64`, repeat=2: | C | TTFT mean | TTFT max | Failures | | ---: | ---: | ---: | ---: | | 1 | 0.643 s | 0.646 s | 0 / 2 | | 2 | 1.019 s | 1.520 s | 0 / 4 | | 4 | 1.724 s | 3.096 s | 0 / 8 | No short-context regression versus the same-machine pre-top-k baseline (4,047 prompt tokens: C=1 0.689 s, C=2 1.125 s, C=4 2.072 s TTFT mean). ### Post-rebase MTP C=4 stability and 64K/128K gate (2026-05-19) After the upstream DeepSeek V4 refactor rebase, a short-context MTP C=4 stability blocker was traced to DeepSeek V4 MTP full decode CUDA graph replay. The current branch skips full decode CUDA graph capture for DeepSeek V4 MTP while keeping PIECEWISE CUDA graphs, and caps DeepSeek V4 MTP warmup / dummy sampler request shapes at 32. Serve logs for this run show `PIECEWISE=49` and no full decode graph capture. Short-context MTP matrix, prefix cache disabled, 131K max-model-len, 4096 max-num-batched-tokens, TP=2, 16 prompts: | C | Successful requests | Output tok/s | Mean TTFT | MTP acceptance | | ---: | ---: | ---: | ---: | ---: | | 1 | 16 / 16 | 65.01 | 87.88 ms | 64.16% | | 2 | 16 / 16 | 127.86 | 194.96 ms | 63.39% | | 4 | 16 / 16 | 225.48 | 254.46 ms | 64.85% | Full long-context promotion gate, prefix cache disabled, cold prompts, `max_tokens=128`, repeat=3: | Prompt tokens | C | Requests | TTFT mean | TTFT max | Failures | | ---: | ---: | ---: | ---: | ---: | ---: | | 62,080 | 1 | 3 | 13.009 s | 13.036 s | 0 | | 62,080 | 2 | 6 | 20.370 s | 26.906 s | 0 | | 62,080 | 3 | 9 | 27.672 s | 41.810 s | 0 | | 62,080 | 4 | 12 | 34.554 s | 54.625 s | 0 | | 124,080 | 1 | 3 | 32.779 s | 32.797 s | 0 | | 124,080 | 2 | 6 | 49.830 s | 67.093 s | 0 | | 124,080 | 3 | 9 | 66.912 s | 104.247 s | 0 | | 124,080 | 4 | 12 | 84.197 s | 138.497 s | 0 | GSM8K limit-200, 5-shot, MTP concurrency 1: `exact_match_flexible=0.960`, `exact_match_strict=0.955`. Targeted regression tests for this fix: ```bash python -m pytest tests/v1/worker/test_gpu_model_runner.py::test_deepseek_v4_mtp_dummy_sampler_warmup_caps_large_max_num_seqs tests/v1/worker/test_gpu_model_runner.py::test_dummy_sampler_warmup_does_not_cap_other_mtp_models tests/model_executor/test_deepseek_v4_kernel_warmup.py::test_deepseek_v4_mtp_uniform_decode_warmup_caps_large_max_num_seqs tests/v1/cudagraph/test_cudagraph_dispatch.py::TestCudagraphDispatcher::test_deepseek_v4_mtp_spec_decode_skips_full_decode_graphs -q ``` Result: `4 passed, 16 warnings`. ### Acceptance (toolcall-15 scenario battery) | Variant | score | failures | | --- | ---: | ---: | | no-MTP | 91% | 13 / 135 cases | | MTP=2 | 92% | 12 / 135 cases | This is the first SM12x baseline that evaluates thinking-mode correctly. Two prior harness bugs masked thinking-mode entirely across every earlier retry: 1. The harness was sending `extra_body.thinking={"type":"enabled"}` at the top level, which is the Claude API shape. vLLM's DSv4 chat-template entry reads `chat_template_kwargs.thinking` instead, so every request silently routed to chat mode. Fixed by 323aa1f (confirmed in this PR discussion by qym-ll). 2. The transcript / replay path read `message.reasoning_content`, but this vLLM OpenAI frontend build populates `message.reasoning`. The harness now normalizes both keys. The remaining failures stay concentrated in `TC-06` (Multi-Value Extraction, 7/7 across modes) plus scattered TC-11 / TC-14 / TC-15: characteristic helpfulness-bias / deflect-rather-than-refuse model behaviours, not SM12x regressions. ### Comparison to DeepSeek's official hosted API Same prompts run against `api.deepseek.com/v1/chat/completions` with `model=deepseek-v4-flash`, same `temperature=1.0 top_p=1.0`, and the same thinking-mode shape: | Source | toolcall-15 score | failures / cases | | --- | ---: | ---: | | DeepSeek hosted API | 96% | 2 / 45 (1 round) | | This PR, MTP=2 | 92% | 12 / 135 (3 rounds) | | This PR, no-MTP | 91% | 13 / 135 (3 rounds) | Per-case failure rate: hosted 4.4%, this PR 8.9-9.6%. The hosted service either ships a checkpoint we have not pulled from the HF release, or injects an internal tool-use system prompt. Either way the local vs hosted gap on this PR is the smallest it has been in any baseline shipped here. ### vs 2026-05-12 deployment baseline ([`1c20f1a6d`](https://github.com/jasl/vllm/commit/1c20f1a6d), same hardware) | Metric | 2026-05-12 | 2026-05-17/18 (this PR) | Delta | | --- | ---: | ---: | ---: | | no-MTP mt-bench c=1 tok/s | 89 | 98 | **+10%** | | MTP=2 mt-bench c=1 tok/s | 137 | 165 | **+20%** | | no-MTP mt-bench c=24 tok/s | 557 | 607 | **+9%** | | MTP=2 mt-bench c=24 tok/s | 706 | 846 | **+20%** | | 128K cold C=1 TTFT mean | 60.83 s | 38.23 s | **-37.1%** | ## Verification commands ```bash ruff check vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py tests/v1/attention/test_sm120_deepgemm_fallbacks.py python -m py_compile vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py tests/v1/attention/test_sm120_deepgemm_fallbacks.py python -m pytest tests/v1/attention/test_sm120_deepgemm_fallbacks.py -q ``` Result: `4 passed, 16 warnings`. Long-context matrix verification: ```bash # 64K / 128K, C=1/2/4, cold, prefix cache disabled scripts/run_long_context_latency_matrix.sh # 4K warmed short-context regression, C=1/2/4 scripts/run_long_context_latency_matrix.sh ``` Results: 64K/128K matrix `PASS`, 6 groups, 0 failures; warmed short-context matrix `PASS`, 3 groups, 0 failures. ## Known caveats - **MTP=1 NCCL allgather hang** under sustained multi-stream load was reproduced once in earlier baselines at c=4 mid-bench. This is outside the SM12x fallback patch surface (Torch NCCL `ProcessGroupWatchdog`) and MTP=1 remains smoke-tier pending repro on NCCL 2.30.4+. - **MTP=3 demoted to smoke-tier**: net slightly slower than MTP=2 at every c measured so far. Worth re-checking if upstream MTP draft kernels become cheaper per K. - **Prefix caching disabled** in the reference cold-prefill numbers above. The locally cherry-picked `vllm-project/vllm#42784` fix means prefix cache does work on DSv4 SWA when enabled; a cache-on companion run is still useful for real document-chat deployment. - **Context limit of this validation host**: the current dual RTX PRO 6000 setup can validate up to ~131K model length. 256K, 512K, and 1M scenarios still need larger GPU count / KV budget validation. ## Acknowledgments - @alexbi29 contributed three improvements landed in this revision: - **Multi-head prefill accumulate kernel** (`_accumulate_indexed_attention_chunk_multihead_kernel`, HEAD_BLOCK=8), patterned after the existing decode `_finish_materialized_scores_with_sink_kernel`. - The SWA `_cache_block_mask` over-aggression for Eagle/MTP groups, fixed by `vllm-project/vllm#42784` (cherry-picked locally pending upstream merge). - The `_deepseek_v4_sm12x_fp8_einsum_kernel` autotune key including `num_tokens`, causing per-request 4-config re-benchmarks; we pinned the winning config and removed the decorator. - @aabbccddwasd contributed the **C128A prefill KV gather overlap** with the indexer (`_aux_stream[1]` overlap of `dequantize_and_gather_k_cache` with `indexer.forward`). - @aabbccddwasd's PR-comment suggestions also led to the per-token early-exit on sparse MLA accumulate, the C128A top-k metadata loop cap at `effective_topk`, and the multi-head prefill kernel direction. - @infernix pointed to the fast-prefill autoresearch branch and DeepGEMM SM120 work in [this PR discussion](https://github.com/vllm-project/vllm/pull/41834#issuecomment-4476480477). The latest direct MQA top-k fallback commits were evaluated from that lead and keep the effective path in this PR branch. ## AI assistance disclosure Claude (Anthropic), GPT-5.4, and GPT-5.5 were used for code review, refactoring, regression-script writing, and benchmark analysis. All kernel logic and architectural decisions were validated by human review and end-to-end benchmarks before each push.

ekkis · May 31, 2026, 10:32am

I haven’t had any crashes yet, the only clear issue I have is very occasionally I get a tool call error with output in Opencode like this:

     <｜DSML｜tool                                                                                                                                                                                                                           
                                                                                                                                                                                                                                             
     _calls>                                                                                                                                                                                                                                 
     <｜DSML｜invoke name="read">                                                                                                                                                                                                            
     <｜DSML｜parameter name="offset" string="false">1340</｜DSML｜parameter>                                                                                                                                                                
     <｜DSML｜parameter name="filePath" string="true">path to file redacted</｜DSML｜parameter>                                                                                                
     <｜DSML｜parameter name="limit" string="false">120</｜DSML｜parameter>                                                                                                                                                                  
     </｜DSML｜invoke>                                                                                                                                                                                                                       
     </｜DSML｜tool_calls>

It doesn’t happen often enough to be really annoying though.

baixu.yan · May 31, 2026, 10:38am

I run these command step by step with --force-rebuild in last command，and encounter this error. Any clue about what and why? I really want to experience 40t/s，but by now every try failed，which drived me crazy.
I will really appreciate if someone can give me an end-to-end solution to deploy the amazing solutions you are talking about ~

paxren2020 · May 31, 2026, 10:43am

Try doing it this way:

docker builder prune
./build-and-copy.sh -t vllm-node-220-1-41834-ds4 --apply-vllm-pr 41834 --rebuild-vllm --cleanup -c
VLLM_SPARK_EXTRA_DOCKER_ARGS=“-v $HOME/DATA/hf/models/:/models” ./run-recipe.py deepseek-v4-flash --no-ray

And then use the recipe to launch it (just tweak it slightly to fit your setup).

recipe

recipe_version: "1"
name: DeepSeek-V4-Flash
description: DeepSeek V4 Flash FP8 on dual DGX Spark TP=2 with PR 41834 SM12x support
model: deepseek-ai/DeepSeek-V4-Flash
container: vllm-node-220-1-41834-ds4
cluster_only: true

build_args:
  - --apply-vllm-pr
  - "41834"
  - --rebuild-vllm

mods:
#  - mods/fix-ds4-gpu-cache
  - mods/drop-caches

defaults:
  port: 8888
  host: 0.0.0.0
  tensor_parallel: 2
  pipeline_parallel: 1
  gpu_memory_utilization: 0.90
  max_model_len: 262144
  max_num_batched_tokens: 6144  # 8192 #16384  # 4192
  max_num_seqs: 8
  block_size: 256
  served_model_name: my-ds4

env:
  TORCH_CUDA_ARCH_LIST: 12.1a
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
  VLLM_TRITON_MLA_SPARSE: 1
  FLASHINFER_DISABLE_VERSION_CHECK: 1
  TILELANG_CLEANUP_TEMP_FILES: 1
  DG_JIT_USE_NVRTC: 0
  DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
  DG_JIT_PRINT_COMPILER_COMMAND: 1
  NCCL_IB_DISABLE: 0
  NCCL_DEBUG: WARN
  OMP_NUM_THREADS: 4

command: |
  vllm serve \
      --model /models/deepseek-ai/DeepSeek-V4-Flash \
      --served-model-name {served_model_name} \
      --host {host} \
      --port {port} \
      --trust-remote-code \
      --tensor-parallel-size {tensor_parallel} \
      --pipeline-parallel-size {pipeline_parallel} \
      --kv-cache-dtype fp8 \
      --block-size {block_size} \
      --enable-prefix-caching \
      --max-model-len {max_model_len} \
      --max-num-seqs {max_num_seqs} \
      --max-num-batched-tokens {max_num_batched_tokens} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      --distributed-executor-backend mp \
      --compilation-config '{{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}}' \
      --tokenizer-mode deepseek_v4 \
      --tool-call-parser deepseek_v4 \
      --enable-auto-tool-choice \
      --reasoning-parser deepseek_v4 \
      --reasoning-config '{{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}}' \
      --default-chat-template-kwargs '{{"thinking":true}}' \
      --load-format safetensors


#      --speculative-config '{{"method":"mtp","num_speculative_tokens":2}}' \

baixu.yan · May 31, 2026, 11:02am

Thanks. I’m trying this new version. The build is on process, probably about 1.5 hours. God bless me, wish I could make it done tonight.

paxren2020 · May 31, 2026, 11:41am

I also catch this error sometimes.

`>` `<｜DSML｜invoke name="bash">` `<｜DSML｜parameter name="description" string="true">Check binary dates and debug</｜DSML｜parameter>` `<｜DSML｜parameter name="command" string="true">ls -la /workspace/cgraph2dot && stat /workspace/cgraph2dot</｜DSML｜parameter>` `</｜DSML｜invoke>` `</｜DSML｜tool_calls>`

wolttam · May 31, 2026, 1:16pm

This is a known issue with nvidia’s cutlass library on 4.5.x, due to a race condition.

You can fix the issue by additng this as the last line in spark-vllm-docker/Dockerfile:

RUN uv pip install --force-reinstall --no-deps nvidia-cutlass-dsl-libs-cu13==4.5.2

0rand · May 31, 2026, 5:25pm

Thanks to your work I have DeepSeek v4 Flash up and running today on my 2-Spark Ray cluster. Very impressive - top scores in tool-call comparing to my existing champions (Nemo 3 Super, Qwen 3.5 122B, Mistral 4 Small 119B), very very solid speed (~37 t/s stable), excellent drafter. Running with 1M context and MTP, big bench - totally sold.

PS: started building the image and wheels, then found out your pre-made image from 2 days ago, pulled and voila. Easy as pie. Thank you!

baixu.yan · June 1, 2026, 3:23am

It works. Thank you~

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	47	4168	May 31, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	15000	May 18, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1609	May 11, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	65	5150	May 30, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	2043	May 17, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1107	June 4, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	8292	March 28, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4473	February 27, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2536	December 25, 2025
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5764	March 16, 2026

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Related topics