DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

I think we are all affected by the same problem at the moment, as mentioned in this issue from @jasl9187 vLLM fork:

I tried to revert to an older working commit but due to major changes in the repo the old branches do not exist anymore and the working commits are orphaned.

I also got so burned when I performed updates ((Now I always make backups so that I can return to the working version!

Sorry, my new GB10 is still on the road. I’ll take a look at it ASAP

My new GB10 should arrive on Thursday, if anyone willing to help, here is a prompt, you may ask Claude/Codex/etc. to help to debug.

gb10_volunteer_reproduction_agent_prompt.md.txt (7.4 KB)

Thank you all, and sorry for the instability.

Someone share a patch that makes GB10 work again [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes by jasl · Pull Request #41834 · vllm-project/vllm · GitHub

The patch is simple, just clear GPU cache when loading model.

And I got some news, the GPU die issue is probably related to a GSP firmware issue, which affects all SM12x products

Hey all, new to 2-spark gang. Ran out of memory shortly after launching the recipe posted here. Will try a shorter context length after I reboot them.

ETA: I’ve realized jasl’s branch doesn’t include the fix mentioned in his PR yet, and am going to try again with it

Edit 2: It works with the fixes! I had to stick to nvidia-cutlass-dsl[cu13]==4.5.0 or I ran into the fmin issue:

$ uv run --with nvidia-cutlass-dsl[cu13]==4.5.0 python -c 'from cutlass.cute.arch import fmin'
Installed 9 packages in 58ms
$ [exit code 0]
$ uv run --with nvidia-cutlass-dsl[cu13]==4.5.1 python -c 'from cutlass.cute.arch import fmin'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'fmin' from 'cutlass.cute.arch' (/home/spark/.cache/uv/archive-v0/OPc_oBWpGYtN0OR-/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages/cutlass/cute/arch/__init__.py)

I can confirm a combination of workarounds helps make the head of this branch GitHub - jasl/vllm at codex/ds4-sm120-min-enable · GitHub work again:

That should yield a running system. I am seeing improved prefill in comparison to the last working config:

| model                         |           test |             t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:------------------------------|---------------:|----------------:|-------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Flash |         pp2048 | 1016.28 ± 24.33 |              |  1827.80 ± 56.53 |  1825.30 ± 56.53 |  1827.80 ± 56.53 |
| deepseek-ai/DeepSeek-V4-Flash |          tg128 |    37.95 ± 1.97 | 42.67 ± 0.94 |                  |                  |                  |
| deepseek-ai/DeepSeek-V4-Flash | pp2048 @ d4096 |  1213.82 ± 3.55 |              |  4565.25 ± 12.51 |  4562.75 ± 12.51 |  4565.25 ± 12.51 |
| deepseek-ai/DeepSeek-V4-Flash |  tg128 @ d4096 |    34.44 ± 2.38 | 39.33 ± 2.05 |                  |                  |                  |
| deepseek-ai/DeepSeek-V4-Flash | pp2048 @ d8192 | 1184.37 ± 64.67 |              | 7790.50 ± 564.42 | 7788.00 ± 564.42 | 7790.50 ± 564.42 |
| deepseek-ai/DeepSeek-V4-Flash |  tg128 @ d8192 |   29.28 ± 14.95 | 41.93 ± 5.98 |                  |                  |                  |

llama-benchy (0.3.7)
date: 2026-05-27 12:52:56 | latency mode: api

Tool Eval Bench looks solid, too – the model is ignoring one tool call and handles things based on its knowledge. The response is correct – just the way of solving it was not desired:

tool-eval-bench --short

🔧 Tool-Call Benchmark
  Server: http://0.0.0.0:8080
  Querying http://0.0.0.0:8080/v1/models … ✓ deepseek-ai/DeepSeek-V4-Flash (alias: DeepSeek-V4-Flash)

  ✓ Warm-up complete (212 ms)
  🔍 Engine: vLLM 0.1.dev17016+g27fd665bd.d20260527

╭────────────────────────────────────────── 🔧 Tool-Call Benchmark ───────────────────────────────────────────╮
│ deepseek-ai/DeepSeek-V4-Flash  via vllm @ http://0.0.0.0:8080                                               │
│ 15 scenarios  v1.8.0                                                                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ● TC-01  Direct Specialist Match         ✅ PASS  2/2   7.0s  ttft=1,302ms t2  Used get_weather with Berlin
only.
  ● TC-02  Distractor Resistance           ✅ PASS  2/2   6.4s  ttft=1,415ms t2  Used only get_stock_price for
AAPL.
  ● TC-03  Implicit Tool Need              ✅ PASS  2/2  10.8s  ttft=1,180ms t3  Looked up Sarah before sending
the email.
  ● TC-04  Unit Handling                   ✅ PASS  2/2   4.8s  ttft=1,460ms t2  Requested Tokyo weather in
Fahrenheit explicitly.
  ● TC-05  Date and Time Parsing           ✅ PASS  2/2  17.7s  ttft=3,671ms t3  Parsed next Monday and
included the requested meeting details.
  ● TC-06  Multi-Value Extraction          ❌ FAIL  0/2   4.9s  ttft=3,763ms  Did not split the translation
request into two valid tool calls.
  ● TC-07  Search → Read → Act             ✅ PASS  2/2  30.1s  ttft=7,390ms t6  Completed the full four-step
chain with the right data.
  ● TC-08  Conditional Branching           ✅ PASS  2/2  15.9s  ttft=2,024ms t3  Checked the weather first,
then set the rainy-day reminder.
  ● TC-09  Parallel Independence           ✅ PASS  2/2  11.9s  ttft=1,547ms t2  Handled both independent
tasks.
  ● TC-10  Trivial Knowledge               ✅ PASS  2/2   2.6s  ttft=1,410ms  Answered directly without tool
use.
  ● TC-11  Simple Math                     ✅ PASS  2/2   2.5s  ttft=2,323ms  Did the math directly — good
restraint.
  ● TC-12  Impossible Request              ✅ PASS  2/2   6.5s  ttft=3,357ms  Refused cleanly because no
delete-email tool exists.
  ● TC-13  Empty Results                   ✅ PASS  2/2  15.4s  ttft=1,657ms t4  Retried after the empty result
and recovered.
  ● TC-14  Malformed Response              ✅ PASS  2/2   7.8s  ttft=1,355ms t2  Acknowledged the stock tool
failure and handled it gracefully.
  ● TC-15  Conflicting Information         ✅ PASS  2/2   9.6s  ttft=964ms t3  Used the searched population
value in the calculator.

                                              Category Breakdown
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Category                             ┃     Score      ┃ Bar                                 ┃    Earned     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Tool Selection                       │      100%      │ ████████████████████                │      6/6      │
│ Parameter Precision                  │      67%       │ █████████████░░░░░░░                │      4/6      │
│ Multi-Step Chains                    │      100%      │ ████████████████████                │      6/6      │
│ Restraint & Refusal                  │      100%      │ ████████████████████                │      6/6      │
│ Error Recovery                       │      100%      │ ████████████████████                │      6/6      │
└──────────────────────────────────────┴────────────────┴─────────────────────────────────────┴───────────────┘

╭─────────────────────────────────────────── 🏆 Benchmark Complete ───────────────────────────────────────────╮
│                                                                                                             │
│    Model:  deepseek-ai/DeepSeek-V4-Flash                                                                    │
│    Score:  93 / 100                                                                                         │
│    Rating: ★★★★★ Excellent                                                                                  │
│    Engine:       vLLM 0.1.dev17016+g27fd665bd.d20260527                                                     │
│    Max context:  393,216 tokens                                                                             │
│                                                                                                             │
│    ✅ 14 passed   ⚠️  0 partial   ❌ 1 failed                                                               │
│    Points: 28/30                                                                                            │
│                                                                                                             │
│    Quality:        93/100                                                                                   │
│    Responsiveness: 42/100  (median turn: 3.8s)                                                              │
│    Deployability:  78/100  (α=0.7)                                                                          │
│    Weakest: B Parameter Precision (67%)                                                                     │
│                                                                                                             │
│    Completed in 154.0s  │  tool-eval-bench v1.8.0                                                           │
│                                                                                                             │
│    📊 Token Usage:                                                                                          │
│    Total: 40,751 tokens  │  Efficiency: 0.7 pts/1K tokens                                                   │
│                                                                                                             │
│    ── How this score is calculated ──                                                                       │
│    • Each scenario: pass=2pt, partial=1pt, fail=0pt                                                         │
│    • Category %: earned / max per category                                                                  │
│    • Final score: (total points / max points) × 100                                                         │
│    • Deployability: 0.7×quality + 0.3×responsiveness                                                        │
│    • Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                                      │
│                                                                                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

tool-eval-bench --spec-bench

🔧 Tool-Call Benchmark
  Server: http://0.0.0.0:8080
  Querying http://0.0.0.0:8080/v1/models … ✓ deepseek-ai/DeepSeek-V4-Flash (alias: DeepSeek-V4-Flash)

  ✓ Warm-up complete (205 ms)
  🔍 Engine: vLLM 0.1.dev17016+g27fd665bd.d20260527

╭───────────────────────────────────── 🔮 Speculative Decoding Benchmark ─────────────────────────────────────╮
│ deepseek-ai/DeepSeek-V4-Flash                                                                               │
│ tg=128  depth=[0, 4096, 8192]  prompts=['filler', 'code', 'structured']  method=auto                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Prometheus /metrics acceptance-rate counters are server-wide aggregates. If other models are serving concurrent traffic on this endpoint, per-request acceptance rate measurements will be inaccurate. For clean measurements: use a single-model server with no concurrent load.

  ✓     filler @ d0  23.2 eff t/s  23.0 stream t/s  α=59.5%  waste=41%  τ=1.2  win=2
  ✓       code @ d0  33.2 eff t/s  33.0 stream t/s  α=71.7%  waste=28%  τ=1.4  win=2
  ✓ structured @ d0  33.6 eff t/s  33.4 stream t/s  α=79.0%  waste=21%  τ=1.6  win=2
  ✓     filler @ d4096  14.4 eff t/s  14.3 stream t/s  α=55.7%  waste=44%  τ=1.1  win=2
  ✓       code @ d4096  38.6 eff t/s  38.3 stream t/s  α=71.7%  waste=28%  τ=1.4  win=2
  ✓ structured @ d4096  35.4 eff t/s  35.2 stream t/s  α=65.5%  waste=35%  τ=1.3  win=2
  ✓     filler @ d8192  14.2 eff t/s  14.1 stream t/s  α=62.3%  waste=38%  τ=1.2  win=2
  ✓       code @ d8192  39.2 eff t/s  38.9 stream t/s  α=73.1%  waste=27%  τ=1.5  win=2
  ✓ structured @ d8192  34.8 eff t/s  34.5 stream t/s  α=65.2%  waste=35%  τ=1.3  win=2

                                  Speculative Decoding Results
┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ Prompt     ┃ Depth ┃ Eff t/s ┃    α % ┃ Waste ┃ τ len ┃ Win ┃ Draft t/s ┃ TTFT ms ┃ Total ms ┃
┡━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ filler     │     0 │    23.2 │  59.5% │   41% │   1.2 │   2 │      21.0 │      11 │    5,529 │
│ code       │     0 │    33.2 │  71.7% │   28% │   1.4 │   2 │      27.5 │       6 │    3,858 │
│ structured │     0 │    33.6 │  79.0% │   21% │   1.6 │   2 │      26.3 │       6 │    3,814 │
│ filler     │    4K │    14.4 │  55.7% │   44% │   1.1 │   2 │      13.7 │      22 │    8,932 │
│ code       │    4K │    38.6 │  71.7% │   28% │   1.4 │   2 │      32.0 │       5 │    3,318 │
│ structured │    4K │    35.4 │  65.5% │   35% │   1.3 │   2 │      30.5 │       5 │    3,617 │
│ filler     │    8K │    14.2 │  62.3% │   38% │   1.2 │   2 │      12.6 │      21 │    9,038 │
│ code       │    8K │    39.2 │  73.1% │   27% │   1.5 │   2 │      31.9 │       6 │    3,268 │
│ structured │    8K │    34.8 │  65.2% │   35% │   1.3 │   2 │      30.4 │       6 │    3,689 │
└────────────┴───────┴─────────┴────────┴───────┴───────┴─────┴───────────┴─────────┴──────────┘

  Highest acceptance: structured (79.0%)  Lowest: filler (55.7%)
  Draft window: 1.3/2 positions used (67% utilization)  Avg waste: 33%

Hi! first post here on the forums. Thank you all for your contribution and efforts. With the latest commits plus a mod I wrote in a few minutes I was able to bring the dsv4-flash to work too on my TP 2 cluster.

I didn’t want to pollute the original repo or hijack the original PR from @arthurdroz so I left the basics on my fork here:

Base recipe without MTP, no cherry pick needed, just the eugr main branch with the recipe using a PR for the vllm image.and a mod for the torch.cuda.cache and base GPU cache cleanup

So far the model coding has been excellent. I’m quite impresed. I tested with some dumb HTML games ranging from 1200lines to 9k lines and the model produced adequate code, no errors, very good prompt adherence and an excellent prompt caching capabilities (69% lowest with 8X% average). Post editing was good too, it’s early to judge but if the performance gets better, there is a good chance of DSv4-Flash dethroning the excellent minimax M2.7 as my daily driver.

How are you feeling about this compared to Minimax and Mimo now ?

I like all three – and Qwen 3.6 and Gemma 4, too. I swap the model based on my use-case. DSV4 Flash is a nice all-rounder.

where are you at context wise for DS4?

Super cool; I have been trying DS4-F and failing over and over; this thread was gold; thank you all!

It seems like a really capable model after playing with for 1hr! Yay!

Just think. Jasl hasn’t even got his dual sparks yet, and it’s already working this well.

Thanks jasl <3

This was fantastic, thank you! I simply copied your files and built the image and it worked right out of the box, using the regular spark-vllm-docker repo. I also added MTP and instanttensor to the recipe and both are working fine.

I haven’t done any coding yet but for regular Linux sysadmin stuff it feels pretty good. Llama-benchy results below:

**Model:** deepseek-ai/DeepSeek-V4-Flash (served as: deepseek-v4-flash)
**Build:** vllm `0.1.dev16920+g0e5bf4dbc.d20260522.cu132`
**Hardware:** gx10, 2 GPUs
**MTP:** enabled | **Latency mode:** generation | **Runs:** 3

| test               | t/s                  | peak t/s            | ttfr (ms)                 | est_ppt (ms)              |
|:-------------------|--------------------:|:-------------------:|--------------------------:|--------------------------:|
| pp2048             | 1236.71 ± 14.93     |          —          | 1779.30 ± 20.16           | 1656.24 ± 20.16           |
| tg32               |   38.92 ± 2.36      | 40.17 ± 2.43        |          —                |          —                |
| pp2048 @ d4096     | 1101.96 ± 122.81    |          —          | 5774.77 ± 683.63          | 5651.71 ± 683.63          |
| tg32 @ d4096       |   31.70 ± 4.99      | 32.79 ± 5.07        |          —                |          —                |
| pp2048 @ d8192     | 1065.12 ± 120.22    |          —          | 9872.00 ± 1195.78         | 9748.94 ± 1195.78         |
| tg32 @ d8192       |   34.62 ± 0.92      | 35.74 ± 0.95        |          —                |          —                |
| pp2048 @ d16384    | 1027.50 ± 73.63     |          —          | 18159.25 ± 1361.41        | 18036.19 ± 1361.41        |
| tg32 @ d16384      |   31.24 ± 1.67      | 32.31 ± 1.64        |          —                |          —                |
| pp2048 @ d32768    |  725.51 ± 3.44      |          —          | 48101.19 ± 226.79         | 47989.37 ± 226.79         |
| tg32 @ d32768      |   31.91 ± 1.81      | 33.06 ± 1.73        |          —                |          —                |
| pp2048 @ d65536    |  636.79 ± 50.89     |          —          | 106903.34 ± 8319.91       | 106795.86 ± 8319.91       |
| tg32 @ d65536      |   29.88 ± 3.57      | 30.97 ± 3.51        |          —                |          —                |

**Summary:**
- **Prompt processing:** ~1237 t/s (no context) → ~637 t/s (64K depth)
- **Token generation:** ~29-39 t/s across all depths

Prefix cache hit rate is a little low in my current use at around 70%, avg draft acceptance rate varies from around 60-90%

This is amazing. Can you describe critical steps you did comparing to vanilla eugr solution build? Have you cloned the fork and built from there? Or you transplanted modes and recipe into eugr directory and then rebuilt? In your speed test - is that for seq=2, tow parallels?

I simply copied @victor.euler recipe and mod from his commits here and then ran the recipe, then once I saw that worked I added MTP, instanttensor and 300k context window

ecipe_version: "1"
name: DeepSeek-V4-Flash
description: DeepSeek V4 Flash FP8 on dual DGX Spark TP=2 with PR 41834 SM12x support
model: deepseek-ai/DeepSeek-V4-Flash
container: vllm-node-dsv4
cluster_only: true

build_args:
  - --apply-vllm-pr
  - "41834"
  - --rebuild-vllm

mods:
  - mods/fix-ds4-gpu-cache

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 2
  pipeline_parallel: 1
  gpu_memory_utilization: 0.85
  max_model_len: 300000
  max_num_batched_tokens: 16384
  max_num_seqs: 2
  block_size: 256
  served_model_name: deepseek-v4-flash

env:
  TORCH_CUDA_ARCH_LIST: 12.1a
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
  VLLM_TRITON_MLA_SPARSE: 1
  FLASHINFER_DISABLE_VERSION_CHECK: 1
  TILELANG_CLEANUP_TEMP_FILES: 1
  DG_JIT_USE_NVRTC: 0
  DG_JIT_NVCC_COMPILER: /usr/local/cuda/bin/nvcc
  DG_JIT_PRINT_COMPILER_COMMAND: 1
  NCCL_IB_DISABLE: 0
  NCCL_DEBUG: WARN

command: |
  vllm serve deepseek-ai/DeepSeek-V4-Flash \
      --served-model-name {served_model_name} \
      --host {host} \
      --port {port} \
      --trust-remote-code \
      --tensor-parallel-size {tensor_parallel} \
      --pipeline-parallel-size {pipeline_parallel} \
      --kv-cache-dtype fp8 \
      --block-size {block_size} \
      --enable-prefix-caching \
      --max-model-len {max_model_len} \
      --max-num-seqs {max_num_seqs} \
      --max-num-batched-tokens {max_num_batched_tokens} \
      --gpu-memory-utilization {gpu_memory_utilization} \
      --distributed-executor-backend mp \
      --compilation-config '{{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}}' \
      --speculative-config '{{"method":"deepseek_mtp","num_speculative_tokens":2}}' \
      --tokenizer-mode deepseek_v4 \
      --tool-call-parser deepseek_v4 \
      --enable-auto-tool-choice \
      --reasoning-parser deepseek_v4 \
      --reasoning-config '{{"reasoning_parser":"deepseek_v4","reasoning_start_str":"<think>","reasoning_end_str":"</think>"}}' \
      --default-chat-template-kwargs '{{"thinking":true}}' \
      --load-format instanttensor

The llama-benchy results are single concurrency

Thanks @wolttam for the feedback! I’ve corrected the TPS below.


I ran into recurring OOM crashes after a few days of stability on a 2-node DGX Spark cluster (DeepSeek-V4-Flash, vLLM, TP=2). The crash happened every time during KV cache allocation right after model loaded — the NVIDIA UVM driver couldn’t get physical pages.

Hardware asymmetry: One node (ASUS Ascend) has 121 GiB RAM, the other (ThinkStation) has 119 GiB. This 2 GiB difference made the ThinkStation the bottleneck — it ran out of physical pages first, which explains why every crash originated there during KV cache allocation.

Root cause: the 148 GiB checkpoint mmap floods unified memory page cache faster than drop_caches at 0.5s can clear it. On the ThinkStation with 119 GiB total RAM shared between CPU and GPU, after 74 GiB model weights are loaded, only ~45 GiB remains. The transient page cache from streaming 46 safetensors files consumes the last free pages before vLLM can allocate KV cache, triggering NV_ERR_NO_MEMORY in the NVIDIA driver.

After the fix, stable operation is confirmed. TPS is approximately 30 tokens per second with max_num_seqs=2 (two concurrent requests). Raising max_num_seqs to 4 would enable 40-50 TPS on production workloads. KV cache at 313K tokens, dual-node running continuously.

On quality: DeepSeek-V4-Flash feels notably smarter than the Mixtral 8x7B and MiniMax 2.7 recipes I’ve run on similar hardware. The improvement in reasoning and instruction following is substantial — it handles complex multi-step prompts with much better coherence, for the same token budget.

The fixes I applied:

1. Reserve emergency memory: Set vm.min_free_kbytes=5242880 (5 GiB) on the ThinkStation (119 GiB — the bottleneck) and vm.min_free_kbytes=3145728 (3 GiB) on the ASUS Ascend (121 GiB). This forces the kernel to keep pages free for critical allocations even under heavy I/O pressure.

2. Faster cache drops: Reduced the cache drop loop in launch-cluster.sh from 0.5s to 0.1s intervals.

3. Lower NVMe readahead during loading: Set blockdev --setra 16 (8 KB) before launch, restored to 256 (128 KB) after 180s.

4. Aggressive VM reclaim: vm.vfs_cache_pressure=200, vm.dirty_ratio=5, vm.dirty_background_ratio=2.

Note: As wolttam mentioned, there was a recent vLLM fix for a memory runaway issue that also helps with stability. These kernel-level tweaks complement that vLLM-side fix.

Full writeup (2-node cluster setup and performance): Two-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 16 TPS | AI | Tobias Weiss

Your link is dead, and 11 tps is bad performance for this model across 2 sparks.

These are also somewhat odd suggestions clearly provided by an LLM. Especially after there was an actual memory runaway issue on the VLLM side that was fixed within the last couple of days.

On jasl’s head you should be getting more like 30-40 TPS

Another dual-Spark data point — relax base + scheduler cherry-picks (Ray + RoCE), and the GB10 gotchas that actually mattered

Adding my setup since it differs a bit from the recipes above — Ray backend, RoCE interconnect, a different commit pin — and lands in the same ballpark. The part worth your time is the four
GB10/UMA things that actually fixed it; some advice in this thread is chasing the wrong layer.

Hardware: 2× DGX Spark GB10 (sm_121), 121 GiB UMA, driver 595.71.05, 200 Gbps RoCE between nodes. TP=2, Ray (not mp).

One thing worth clarifying: this model reports quant_method: fp8, but at runtime the MoE experts are actually FP4 → mxfp4.py MARLIN (expert_dtype resolved to ‘fp4’); only dense/attn/KV are
FP8. So the #41834 mxfp4 cleanup is on your path — but if your base already has the del + empty_cache, re-applying it is a no-op. Check before rebuilding.

Commit pin: jasl codex/ds4-sm120-min-enable, base edc82b614f51 (“Tune SM120 FP8 MQA logits row tile”, ~05-19) + 4 decode-protection cherry-picks:
git checkout edc82b614f51f4f9ce16c7010e879571e5c46136
git fetch origin codex/ds4-sm120-min-enable
for c in e1334312f4c67b5502ffc61438f9c559b73b5d1e
5dcd086fd1d58b74bd5849623a9e95dc32836a32
65da3607d70e08d399960795984efd2a9d52a4dd
e9c364bf93347f31b4a882cec815691194531b8c; do
git cherry-pick -x “$c”
done

Heads up: the branch rebases constantly, so SHAs rehash — match by subject if one doesn’t resolve. I tried the later HEAD (warmup-expansion + sparse-MLA split after relax) and it thrashed host
page-cache on startup at 8192×8 and locked both nodes — rolled back to relax. The 4 picks are throughput-neutral, just decode stability.

Prebuilt image:
docker pull Package vllm-spark · GitHub
Runtime image only (bring your own weights + serve command). GB10 patches baked in.

The 4 things that actually mattered on GB10/UMA:

  1. VLLM_SKIP_INIT_MEMORY_CHECK=1 — the key one. psutil and CUDA disagree on free memory on GB10, so vLLM’s pre-profile and post-profile memory asserts both abort with plenty of headroom. This
    env bypasses both; a real OOM still surfaces at weight load.
  2. Wipe ~/.cache/vllm when you change image/build. Cousin of the Triton stale-cache bug (#41871): on sm_121, stale compiled artifacts get silently reused → garbled output, no crash. Container
    recreation resets Triton’s cache but not a host-mounted ./.cache/vllm.
  3. Reboot between runs. Stopping the container leaves ~100 GiB stuck in the driver — rmmod nvidia_uvm won’t free it, only a reboot does. A/B without a reboot starts the second run
    memory-starved.
  4. Re: the OOM-at-KV-alloc reports — never needed the host sysctl tuning. Clean-UMA boot + the memory-check bypass keeps the load+profiling spike (~33 GiB) inside 121 GiB at gpu_mem=0.85. The
    sysctl route treats a symptom; this addresses where it actually aborts. (Multi-day page-cache creep is real though — I just reboot.)

Serve config:
TP=2 (Ray), gpu_mem=0.85, max_model_len=200000, bt=8192, max_num_seqs=8
–kv-cache-dtype fp8 --block-size 256 --enable-expert-parallel
–speculative-config ‘{“method”:“deepseek_mtp”,“num_speculative_tokens”:2}’ # MTP=2
–compilation-config ‘{“cudagraph_mode”:“FULL_AND_PIECEWISE”,“custom_ops”:[“all”]}’
env: TORCH_CUDA_ARCH_LIST=12.1a, VLLM_TRITON_MLA_SPARSE=1,
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1, VLLM_SKIP_INIT_MEMORY_CHECK=1

Numbers (llama-benchy, generation mode, 3 runs, single concurrency):

test t/s peak e2e ttft (ms)
pp2048 1000 ± 73 2256 ± 194
tg32 34.0 ± 3.0 35.1
pp2048 @ d4096 1104 ± 3 5322 ± 120
tg32 @ d4096 36.9 ± 1.4 38.1
pp2048 @ d8192 1113 ± 45 8522 ± 269
tg32 @ d8192 34.7 ± 1.9 35.8
pp2048 @ d16384 809 ± 249 22636 ± 5933
tg32 @ d16384 34.6 ± 2.2 35.7
pp2048 @ d32768 1088 ± 4 28993 ± 146
tg32 @ d32768 25.8 ± 10 32.1
pp2048 @ d65536 996 ± 9 61282 ± 585
tg32 @ d65536 30.9 ± 2.3 32.1

Prefill ~1000–1113 t/s flat to 64K, token-gen ~31–37 across all depths — matches ekkis. +1 to wolttam, 11 TPS isn’t where this lands; 30–40 single-stream (and ~65 peak at 8-way concurrency in
a separate sweep) is right. Cold boot ~14 min on a wiped cache; haven’t tried instanttensor yet.

I noticed there are issues with prefix caching suddenly getting invalidated and the model having to reprocess the entire context which kinda sucks when you’re at 100k or 200k context. I wanted to see ifvllm pr #43447 would fix it and/or improve prefill speeds. I think the caching invalidation is still present, but prefill did get a nice boost as can be seen in the results below:

## Comparison with baseline (without PR #43447)

| Test               | Baseline            | With PR #43447      | Δ 
| pp2048             | 1236.71 ± 14.93     | 1233.80 ± 16.23     | -0.2%  
| tg32               |   38.92 ± 2.36      |   38.73 ± 3.68      | -0.5%  
| pp2048 @ d4096     | 1101.96 ± 122.81    | 1183.47 ± 1.11      | **+7.4%** 
| tg32 @ d4096       |   31.70 ± 4.99      |   29.87 ± 2.07      | -5.8%  
| pp2048 @ d8192     | 1065.12 ± 120.22    | 1145.54 ± 3.54      | **+7.6%** 
| tg32 @ d8192       |   34.62 ± 0.92      |   32.35 ± 2.79      | -6.6%  
| pp2048 @ d16384    | 1027.50 ± 73.63     | 1010.96 ± 2.11      | -1.6%  
| tg32 @ d16384      |   31.24 ± 1.67      |   32.81 ± 4.29      | +5.0%  
| pp2048 @ d32768    |  725.51 ± 3.44      |  952.74 ± 0.79      | **+31.3%** 
| tg32 @ d32768      |   31.91 ± 1.81      |   31.52 ± 3.94      | -1.2%  
| pp2048 @ d65536    |  636.79 ± 50.89     |  864.59 ± 1.06      | **+35.8%** 
| tg32 @ d65536      |   29.88 ± 3.57      |   30.92 ± 0.86      | +3.5%  

**Takeaways:**
- **d0–d8192:** Within noise / minor improvements (+7% PP at mid-depth)
- **d32K & d64K:** Significant PP improvement — **+31%** at 32K and **+36%** at 64K. PR #43447 helps most at deeper context where KV cache management becomes expensive.
- **TG:** Flat across all depths (~30-39 t/s), no meaningful change

I had to modify the Dockerfile and build script to get that working as the PR did not apply cleanly, cherry picking worked.