## Latest validation snapshot (2026-05-29)
Hardware: 2x NVIDIA RTX PRO 6000 Bla…ckwell Workstation Edition (SM120).
PR head: [`0236f021d`](https://github.com/jasl/vllm/commit/0236f021d7f6fd3b1af290c153547a11659e08a1).
Baseline labels: `20260529_mhc_fix_pr_perf_gate_baseline_primary`, `20260529_pr43477_temp0_c124_comparison`.
Primary serve profile: TP=2, MTP=2, FP8 KV cache, block size 256, max model len 131072, prefix cache disabled for cold-latency baselines, and `FULL_AND_PIECEWISE` CUDA graph compilation. The focused temp=0 random check used max model len 65536.
All primary phases exited 0: server startup, 59K/124K latency matrix, frontier context sweep, DS4 story semantic gate, long-context decode concurrency, mixed arrival, streaming pressure, HF mt-bench, GSM8K, random prefill sweep, random 8K/1K, and random 256/256. Runtime error counters stayed at 0 in primary phases.
### Correctness
`lm_eval` GSM8K, 5-shot, 200 questions, `temperature=0`:
| Variant | exact_match_flexible | exact_match_strict |
| --- | ---: | ---: |
| MTP=2 | 95.5% | 93.5% |
DS4 story recall semantic gate: 30,502 prompt tokens, matched all 16 `Name=number` assignments, TTFT 5.833 s, decode 168.37 tok/s, ITL p99 18 ms.
### HF mt-bench throughput (`temperature=1.0`, MTP=2)
| C | Output tok/s | TTFT mean | TPOT mean | MTP acceptance | Acceptance length |
| ---: | ---: | ---: | ---: | ---: | ---: |
| 1 | 154.26 | 93.83 ms | 5.98 ms | 67.69% | 2.35 |
| 2 | 240.38 | 119.40 ms | 7.58 ms | 68.17% | 2.36 |
| 4 | 352.57 | 152.77 ms | 10.16 ms | 67.68% | 2.35 |
| 8 | 362.16 | 2070.98 ms | 10.17 ms | 68.08% | 2.36 |
| 16 | 357.80 | 5767.72 ms | 10.20 ms | 67.22% | 2.34 |
| 24 | 362.20 | 8669.22 ms | 10.08 ms | 68.76% | 2.38 |
### Random 8K/1K throughput (`temperature=0.0`, 16 prompts)
| Variant | C | Output tok/s | TTFT mean | TPOT mean | MTP acceptance | Acceptance length |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| no-MTP | 1 | 90.51 | 1351.74 ms | 9.71 ms | - | - |
| no-MTP | 2 | 142.91 | 2036.42 ms | 11.97 ms | - | - |
| no-MTP | 4 | 211.03 | 3334.25 ms | 15.63 ms | - | - |
| MTP=2 | 1 | 153.47 | 1380.52 ms | 5.14 ms | 82.56% | 2.65 |
| MTP=2 | 2 | 220.51 | 1574.53 ms | 7.39 ms | 86.33% | 2.73 |
| MTP=2 | 4 | 275.58 | 2054.14 ms | 11.74 ms | 81.03% | 2.62 |
`temperature=1.0` random 8K/1K remains a separate stochastic workload: C=1/2/4/8/16/32 output tok/s = 110.41 / 160.55 / 222.91 / 231.44 / 228.34 / 231.31, with MTP acceptance around 51%. Please compare random throughput only with the sampling temperature included.
### Long-context latency (`temperature=0.0`, prefix cache disabled, MTP=2)
| Prompt tokens | C | TTFT mean | TTFT max | Decode mean tok/s | Fairness min/max | ITL p99 | Failures |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 58,980 | 1 | 12.160 s | 12.190 s | 137.05 | 0.959 | 22 ms | 0 |
| 58,980 | 2 | 19.272 s | 25.921 s | 82.56 | 0.237 | 90 ms | 0 |
| 124,080 | 1 | 30.990 s | 31.030 s | 105.81 | 0.998 | 29 ms | 0 |
| 124,080 | 2 | 47.870 s | 63.927 s | 66.41 | 0.286 | 94 ms | 0 |
Long-context decode concurrency at 124K: C=1 decode 105.99 tok/s, ITL p99 29 ms; C=2 decode mean 64.34 tok/s, min/max 30.86 / 97.82 tok/s, ITL p99 96 ms, failures 0.
### DS4 prompt-file frontier sweep
| Prompt | Target frontier | Prompt tokens | TTFT | Input tok/s | Decode tok/s | ITL p99 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| security audit | 65,536 | 17,531 | 3.268 s | 5364.86 | 164.39 | 82 ms |
| security audit | 98,304 | 27,991 | 5.345 s | 5237.08 | 152.87 | 89 ms |
| security audit | 124,000 | 36,649 | 7.232 s | 5067.79 | 118.46 | 23 ms |
| story recall | 65,536 | 16,089 | 2.886 s | 5575.63 | 178.39 | 17 ms |
| story recall | 98,304 | 24,149 | 4.483 s | 5387.39 | 167.18 | 18 ms |
| story recall | 124,000 | 30,478 | 5.867 s | 5194.51 | 168.03 | 18 ms |
### Mixed / continuous pressure
Mixed-arrival gates all passed with 0 failures:
- `decode_then_59k`: primary TTFT mean 12.613 s, secondary TTFT mean 13.505 s, secondary ITL p99 22 ms.
- `decode_then_124k`: primary TTFT mean 31.420 s, secondary TTFT mean 32.255 s, secondary ITL p99 29 ms.
- `long_then_short`: primary TTFT mean 31.814 s, secondary TTFT mean 30.378 s, secondary ITL p99 32 ms.
Streaming pressure matrix: 4 cases, 36 requests, 0 failures, 0 slow cases, max prompt 100,127 tokens, max TTFT 57.591 s, ITL p95 409 ms, ITL p99 737 ms.
Prefix-cache stress with fillers 100 / 400 / 800 / 1600 / 3200 all passed server startup and stress phases.
### Random prefill sweep (`temperature=0.0`, MTP=2)
| Input tokens | Input tok/s | TTFT mean |
| ---: | ---: | ---: |
| 1K | 6159.40 | 165.47 ms |
| 4K | 5990.49 | 683.97 ms |
| 16K | 5504.91 | 2975.94 ms |
| 65K | 4570.95 | 14337.38 ms |
### Known caveats
- The current dual RTX PRO 6000 validation host covers the 128K-class range. 256K, 512K, and 1M context claims still need a larger GPU budget / 4-card validation.
- Mixed long-context C=2 fairness remains the main optimization target: the 124K decode-concurrency run still has a 30.86 / 97.82 tok/s per-request spread.
- Random throughput must always state temperature. At `temperature=1.0`, MTP acceptance is much lower than at `temperature=0.0`, so tok/s is not directly comparable across those settings.
---
## Historical validation notes
## Purpose
Enable DeepSeek V4 Flash on SM12x Blackwell consumer hardware (RTX PRO 6000 Workstation Edition, RTX 5090, DGX Spark GB10).
The core challenge: SM12x lacks the TMEM / `tcgen05` instructions present on datacenter Blackwell (SM10x), so DeepGEMM, FlashMLA, and Marlin's FP8 paths fail at kernel link time on this hardware. This PR provides pure-PyTorch fallbacks, Triton kernel implementations, and SM12x-specific tuning so the model runs end-to-end with production-quality perf.
## Validation results
Hardware: 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition. PR head: [`3424fba51`](https://github.com/jasl/vllm/commit/3424fba51301504262c3d8355e2560469f18c9c4). Rebased on `upstream/main` 2026-05-19; NCCL: `nvidia-nccl-cu13` 2.30.4.
Reference long-context serve config used for the 2026-05-18 run:
```bash
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--kv-cache-dtype fp8 \
--block-size 256 \
--max-model-len 131072 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.98 \
--max-num-seqs 4 \
--max-num-batched-tokens 4096 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}' \
--speculative_config '{"method":"mtp","num_speculative_tokens":2}' \
--no-enable-prefix-caching \
--no-enable-flashinfer-autotune
```
`--no-enable-prefix-caching` is set for these latency baselines so cold prefill numbers are not biased by cache hits. End-user document/chat deployments should generally keep prefix caching enabled.
### Accuracy
`lm_eval` `gsm8k` 5-shot, 200 questions, `temperature=0`, `max_gen_toks=2048`, via `/v1/completions`:
| Variant | strict-match |
| --- | ---: |
| no-MTP | 95.5% |
| MTP=2 | 95.0% |
Within the historical 0.948-0.965 band on this model.
### Performance (mt-bench, `philschmid/mt-bench`, 80 prompts)
| c | no-MTP TPOT (ms) | no-MTP tok/s | no-MTP TTFT med (ms) | MTP=2 TPOT (ms) | MTP=2 tok/s | MTP=2 TTFT med (ms) |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 1 | 9.9 | 98 | 53 | **5.7** | **165** | 59 |
| 2 | 11.7 | 163 | 72 | 7.5 | 248 | 84 |
| 4 | 14.4 | 266 | 77 | 11.8 | 316 | 100 |
| 8 | 19.4 | 380 | 86 | 13.7 | 530 | 117 |
| 16 | 27.3 | 520 | 120 | 19.3 | 721 | 161 |
| 24 | 32.6 | 607 | 391 | **23.2** | **846** | 194 |
MTP=2 peak: **165 tok/s single-stream**, **846 tok/s @ c=24**. MTP=2 acceptance length 2.35-2.38 on real-content prompts, pos-0 acceptance 84-85%.
### Long-context prefill
Earlier long-context work in this PR added `_accumulate_indexed_attention_chunk_multihead_kernel` (HEAD_BLOCK=8) and overlapped the C128A prefill KV gather with the indexer forward. The latest two commits add a direct SM120 MQA top-k fallback path: Triton materializes the FP8 MQA logits, then the existing custom `top_k_per_row_prefill` op selects top-k row indices without the slower PyTorch per-chunk score path.
Dedicated 128K A/B sweep on the same 2x RTX PRO 6000 setup, C=1, cold, `max_tokens=64`:
| Build | 127,056-token TTFT mean | Delta vs parent |
| --- | ---: | ---: |
| Before direct MQA top-k fallback | 60.83 s | - |
| Triton MQA logits + PyTorch top-k (`f32b9e782`) | 37.65 s | -38.1% |
| Triton MQA logits + custom row top-k (`709f50d10`) | 36.87 s | -2.1% |
Conservative full-validation rerun on 2026-05-18, C=1, cold, `max_tokens=64`, repeat=3:
| Prompt tokens | C | TTFT mean | TTFT max | Elapsed mean | Failures |
| ---: | ---: | ---: | ---: | ---: | ---: |
| 63,568 | 1 | 14.71 s | 14.84 s | 15.16 s | 0 / 3 |
| 127,056 | 1 | 38.23 s | 38.38 s | 38.84 s | 0 / 3 |
That conservative rerun is **37.1% lower TTFT** than the pre-top-k 128K baseline (60.83 s -> 38.23 s), a **1.59x** speedup.
Small-concurrency long-context matrix on 2026-05-18, cold salted prompts, `max_tokens=128`, repeat=2, prefix cache disabled:
| Prompt tokens | C | Requests | TTFT mean | TTFT max | Elapsed mean | Failures |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 63,568 | 1 | 2 | 14.96 s | 15.05 s | 15.45 s | 0 |
| 63,568 | 2 | 4 | 23.45 s | 32.18 s | 31.61 s | 0 |
| 63,568 | 4 | 8 | 38.96 s | 63.83 s | 55.67 s | 0 |
| 127,056 | 1 | 2 | 39.93 s | 41.66 s | 40.59 s | 0 |
| 127,056 | 2 | 4 | 58.68 s | 80.52 s | 69.53 s | 0 |
| 127,056 | 4 | 8 | 99.10 s | 162.16 s | 122.25 s | 0 |
Short-context warmed regression check, 4,047 prompt tokens, cold salted prompts, `max_tokens=64`, repeat=2:
| C | TTFT mean | TTFT max | Failures |
| ---: | ---: | ---: | ---: |
| 1 | 0.643 s | 0.646 s | 0 / 2 |
| 2 | 1.019 s | 1.520 s | 0 / 4 |
| 4 | 1.724 s | 3.096 s | 0 / 8 |
No short-context regression versus the same-machine pre-top-k baseline (4,047 prompt tokens: C=1 0.689 s, C=2 1.125 s, C=4 2.072 s TTFT mean).
### Post-rebase MTP C=4 stability and 64K/128K gate (2026-05-19)
After the upstream DeepSeek V4 refactor rebase, a short-context MTP C=4 stability blocker was traced to DeepSeek V4 MTP full decode CUDA graph replay. The current branch skips full decode CUDA graph capture for DeepSeek V4 MTP while keeping PIECEWISE CUDA graphs, and caps DeepSeek V4 MTP warmup / dummy sampler request shapes at 32. Serve logs for this run show `PIECEWISE=49` and no full decode graph capture.
Short-context MTP matrix, prefix cache disabled, 131K max-model-len, 4096 max-num-batched-tokens, TP=2, 16 prompts:
| C | Successful requests | Output tok/s | Mean TTFT | MTP acceptance |
| ---: | ---: | ---: | ---: | ---: |
| 1 | 16 / 16 | 65.01 | 87.88 ms | 64.16% |
| 2 | 16 / 16 | 127.86 | 194.96 ms | 63.39% |
| 4 | 16 / 16 | 225.48 | 254.46 ms | 64.85% |
Full long-context promotion gate, prefix cache disabled, cold prompts, `max_tokens=128`, repeat=3:
| Prompt tokens | C | Requests | TTFT mean | TTFT max | Failures |
| ---: | ---: | ---: | ---: | ---: | ---: |
| 62,080 | 1 | 3 | 13.009 s | 13.036 s | 0 |
| 62,080 | 2 | 6 | 20.370 s | 26.906 s | 0 |
| 62,080 | 3 | 9 | 27.672 s | 41.810 s | 0 |
| 62,080 | 4 | 12 | 34.554 s | 54.625 s | 0 |
| 124,080 | 1 | 3 | 32.779 s | 32.797 s | 0 |
| 124,080 | 2 | 6 | 49.830 s | 67.093 s | 0 |
| 124,080 | 3 | 9 | 66.912 s | 104.247 s | 0 |
| 124,080 | 4 | 12 | 84.197 s | 138.497 s | 0 |
GSM8K limit-200, 5-shot, MTP concurrency 1: `exact_match_flexible=0.960`, `exact_match_strict=0.955`.
Targeted regression tests for this fix:
```bash
python -m pytest tests/v1/worker/test_gpu_model_runner.py::test_deepseek_v4_mtp_dummy_sampler_warmup_caps_large_max_num_seqs tests/v1/worker/test_gpu_model_runner.py::test_dummy_sampler_warmup_does_not_cap_other_mtp_models tests/model_executor/test_deepseek_v4_kernel_warmup.py::test_deepseek_v4_mtp_uniform_decode_warmup_caps_large_max_num_seqs tests/v1/cudagraph/test_cudagraph_dispatch.py::TestCudagraphDispatcher::test_deepseek_v4_mtp_spec_decode_skips_full_decode_graphs -q
```
Result: `4 passed, 16 warnings`.
### Acceptance (toolcall-15 scenario battery)
| Variant | score | failures |
| --- | ---: | ---: |
| no-MTP | 91% | 13 / 135 cases |
| MTP=2 | 92% | 12 / 135 cases |
This is the first SM12x baseline that evaluates thinking-mode correctly. Two prior harness bugs masked thinking-mode entirely across every earlier retry:
1. The harness was sending `extra_body.thinking={"type":"enabled"}` at the top level, which is the Claude API shape. vLLM's DSv4 chat-template entry reads `chat_template_kwargs.thinking` instead, so every request silently routed to chat mode. Fixed by 323aa1f (confirmed in this PR discussion by qym-ll).
2. The transcript / replay path read `message.reasoning_content`, but this vLLM OpenAI frontend build populates `message.reasoning`. The harness now normalizes both keys.
The remaining failures stay concentrated in `TC-06` (Multi-Value Extraction, 7/7 across modes) plus scattered TC-11 / TC-14 / TC-15: characteristic helpfulness-bias / deflect-rather-than-refuse model behaviours, not SM12x regressions.
### Comparison to DeepSeek's official hosted API
Same prompts run against `api.deepseek.com/v1/chat/completions` with `model=deepseek-v4-flash`, same `temperature=1.0 top_p=1.0`, and the same thinking-mode shape:
| Source | toolcall-15 score | failures / cases |
| --- | ---: | ---: |
| DeepSeek hosted API | 96% | 2 / 45 (1 round) |
| This PR, MTP=2 | 92% | 12 / 135 (3 rounds) |
| This PR, no-MTP | 91% | 13 / 135 (3 rounds) |
Per-case failure rate: hosted 4.4%, this PR 8.9-9.6%. The hosted service either ships a checkpoint we have not pulled from the HF release, or injects an internal tool-use system prompt. Either way the local vs hosted gap on this PR is the smallest it has been in any baseline shipped here.
### vs 2026-05-12 deployment baseline ([`1c20f1a6d`](https://github.com/jasl/vllm/commit/1c20f1a6d), same hardware)
| Metric | 2026-05-12 | 2026-05-17/18 (this PR) | Delta |
| --- | ---: | ---: | ---: |
| no-MTP mt-bench c=1 tok/s | 89 | 98 | **+10%** |
| MTP=2 mt-bench c=1 tok/s | 137 | 165 | **+20%** |
| no-MTP mt-bench c=24 tok/s | 557 | 607 | **+9%** |
| MTP=2 mt-bench c=24 tok/s | 706 | 846 | **+20%** |
| 128K cold C=1 TTFT mean | 60.83 s | 38.23 s | **-37.1%** |
## Verification commands
```bash
ruff check vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py tests/v1/attention/test_sm120_deepgemm_fallbacks.py
python -m py_compile vllm/v1/attention/ops/deepseek_v4_ops/sm12x_deep_gemm_fallbacks.py tests/v1/attention/test_sm120_deepgemm_fallbacks.py
python -m pytest tests/v1/attention/test_sm120_deepgemm_fallbacks.py -q
```
Result: `4 passed, 16 warnings`.
Long-context matrix verification:
```bash
# 64K / 128K, C=1/2/4, cold, prefix cache disabled
scripts/run_long_context_latency_matrix.sh
# 4K warmed short-context regression, C=1/2/4
scripts/run_long_context_latency_matrix.sh
```
Results: 64K/128K matrix `PASS`, 6 groups, 0 failures; warmed short-context matrix `PASS`, 3 groups, 0 failures.
## Known caveats
- **MTP=1 NCCL allgather hang** under sustained multi-stream load was reproduced once in earlier baselines at c=4 mid-bench. This is outside the SM12x fallback patch surface (Torch NCCL `ProcessGroupWatchdog`) and MTP=1 remains smoke-tier pending repro on NCCL 2.30.4+.
- **MTP=3 demoted to smoke-tier**: net slightly slower than MTP=2 at every c measured so far. Worth re-checking if upstream MTP draft kernels become cheaper per K.
- **Prefix caching disabled** in the reference cold-prefill numbers above. The locally cherry-picked `vllm-project/vllm#42784` fix means prefix cache does work on DSv4 SWA when enabled; a cache-on companion run is still useful for real document-chat deployment.
- **Context limit of this validation host**: the current dual RTX PRO 6000 setup can validate up to ~131K model length. 256K, 512K, and 1M scenarios still need larger GPU count / KV budget validation.
## Acknowledgments
- @alexbi29 contributed three improvements landed in this revision:
- **Multi-head prefill accumulate kernel** (`_accumulate_indexed_attention_chunk_multihead_kernel`, HEAD_BLOCK=8), patterned after the existing decode `_finish_materialized_scores_with_sink_kernel`.
- The SWA `_cache_block_mask` over-aggression for Eagle/MTP groups, fixed by `vllm-project/vllm#42784` (cherry-picked locally pending upstream merge).
- The `_deepseek_v4_sm12x_fp8_einsum_kernel` autotune key including `num_tokens`, causing per-request 4-config re-benchmarks; we pinned the winning config and removed the decorator.
- @aabbccddwasd contributed the **C128A prefill KV gather overlap** with the indexer (`_aux_stream[1]` overlap of `dequantize_and_gather_k_cache` with `indexer.forward`).
- @aabbccddwasd's PR-comment suggestions also led to the per-token early-exit on sparse MLA accumulate, the C128A top-k metadata loop cap at `effective_topk`, and the multi-head prefill kernel direction.
- @infernix pointed to the fast-prefill autoresearch branch and DeepGEMM SM120 work in [this PR discussion](https://github.com/vllm-project/vllm/pull/41834#issuecomment-4476480477). The latest direct MQA top-k fallback commits were evaluated from that lead and keep the effective path in this PR branch.
## AI assistance disclosure
Claude (Anthropic), GPT-5.4, and GPT-5.5 were used for code review, refactoring, regression-script writing, and benchmark analysis. All kernel logic and architectural decisions were validated by human review and end-to-end benchmarks before each push.