Deepseek V4 released

looks like dual sparks

Transformers 5.8 is out with DeepSeek-V4 support

I don’t think so. It’s really more about the actual implementation of model/kernels for Deepseek v4.

I’m sure at some point NVIDIA will get us updated… but I don’t think that’s the problem here. CUDA 13 major features are all we need – CUDA 13.2 basically brings CUDA-native tile-based kernels/programming which they’re going to have to bring to Spark… but that stuff is early, so we’re not missing out (yet).

I’m curious how much context we could realistically run with two Sparks if we really pushed them. One of the main selling points of DeepSeek V4 is that they found a way to support much larger context windows while using significantly less VRAM. I’d be interested to see how far we could scale context efficiency with a dual Spark setup.

At this rate in six months, who knows maybe we can actually run full context on a 1M context on a dual spark cluster

(EngineCore pid=114) INFO 05-06 03:31:08 [kv_cache_utils.py:1710] GPU KV cache size: 4,627,680 tokens
(EngineCore pid=114) INFO 05-06 03:31:08 [kv_cache_utils.py:1711] Maximum concurrency for 262,144 tokens per request: 17.65x

Hmmm, wait a minute (or several)

[kv_cache_utils.py:1710] GPU KV cache size: 6,463,665 tokens
[kv_cache_utils.py:1711] Maximum concurrency for 1,000,000 tokens per request: 6.46x

1M TOKENS CONTEXT

IT’S RUNNING

That’s awesome, deep seek V4 flash is not as smart as minimax 2.5 or 2.7 but I’m definitely willing to lose intelligence for one 1M context. When you get time, could you post your recipe on Spark Arena?

It’s a tough recipe with the custom vllm build with PR and everything, I don’t think it’s ready yet, but you can try to reproduce it. It’s just the same recipe as above with 1000000 max model length.

I’ll be trying out this version in the next few days ; I’ve been using DS4-Flash for the last two days through their API (so cheap) to see what it can do and it’s actually quite smart.

That’s totally fair. Great job though it’s people like you in the community that just push this hardware to its limits it’s so exciting.

In my test,4 nodes with crs804 can run raw model(v4 flash) in 1m context with 30tokens/s speed in short requests and 15 tokens/s speed in long context.It can use max mode which has 384k output length.

1m context is not generally good as you expect it:

imma try t he recipe when you say its not ready what are issues you are ssing

you can pass build args in the recipe itself: spark-vllm-docker/recipes at main Ā· eugr/spark-vllm-docker Ā· GitHub

Got it working doing this.. 20 t/s

**DeepSeek V4 Flash W4A16-FP8 — Dual DGX Spark TP=2 — 1M Context — VERIFIED WORKING**

Built via [eugr/spark-vllm-docker]( GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks Ā· GitHub ) + 2 patches from [pasta-paul/dsv4-flash-w4a16-fp8]( GitHub - pasta-paul/dsv4-flash-w4a16-fp8: DeepSeek-V4-Flash W4A16-FP8 quantization on 8x H200 — patches, recipe, mission report Ā· GitHub ). Endpoint stable at 1M context with 85% memory utilization.

**Hardware:** 2Ɨ DGX Spark GB10 (SM 12.1a, 121 GiB UMA each), QSFP56 200G interconnect at 169.254.x.x.

**Build:**

```bash

cd ~/spark-vllm-docker

./build-and-copy.sh \

--apply-vllm-pr 40991 \

--apply-vllm-pr 41276 \

--rebuild-vllm \

-t vllm-node-dsv4

```

**Apply patches inside the resulting image:**

```bash

docker run --name dsv4-patcher \

-v ~/dsv4-flash-w4a16-fp8/scripts/patch_v4_packed_mapping.py:/tmp/p1.py:ro \

vllm-node-dsv4:latest \

bash -c ā€˜DSV4=$(python3 -c ā€œimport vllm.model_executor.models.deepseek_v4 as m; print(m._file_)ā€ 2>/dev/null | tail -1); python3 /tmp/p1.py ā€œ$DSV4ā€ā€™

docker commit dsv4-patcher vllm-node-dsv4:latest

docker rm dsv4-patcher

```

(`patch_workspace_prereserve.py` will fail on this vllm build because its target anchor moved — that’s OK, just use `–enforce-eager` in the serve command instead.)

**Recipe `recipes/deepseek-v4-flash.yaml`:**

```yaml

recipe_version: ā€œ1ā€

name: DeepSeek-V4-Flash-W4A16

model: pastapaul/DeepSeek-V4-Flash-W4A16-FP8

container: vllm-node-dsv4

cluster_only: true

mods: []

defaults:

port: 8000

host: 0.0.0.0

tensor_parallel: 2

gpu_memory_utilization: 0.85

max_model_len: 1048576

env:

TORCH_CUDA_ARCH_LIST: ā€œ12.1aā€

VLLM_ALLOW_LONG_MAX_MODEL_LEN: ā€œ1ā€

command: |

vllm serve pastapaul/DeepSeek-V4-Flash-W4A16-FP8 \

  --served-model-name deepseek-v4-flash \\

  --trust-remote-code \\

  --kv-cache-dtype fp8 \\

  --block-size 256 \\

  --tokenizer-mode deepseek_v4 \\

  --tool-call-parser deepseek_v4 \\

  --enable-auto-tool-choice \\

  --reasoning-parser deepseek_v4 \\

  --max-model-len {max_model_len} \\

  --max-num-seqs 4 \\

  --max-num-batched-tokens 8192 \\

  --gpu-memory-utilization {gpu_memory_utilization} \\

  --host {host} \\

  --port {port} \\

  -tp {tensor_parallel} \\

  --enforce-eager \\

  --distributed-executor-backend ray

```

**Run:**

```bash

./run-recipe.sh recipes/deepseek-v4-flash.yaml

```

**Notes:**

- `gpu_memory_utilization: 0.85` keeps system memory at 91-92% (`0.92` pushes to 99% Critical on Netdata)

- `–enforce-eager` is required without the workspace prereservation patch (~4Ɨ decode penalty vs cudagraphs, but stable)

- `max_model_len: 1048576` confirmed working at 1M with `num_gpu_blocks: 26,091` Ɨ `block_size: 256` = **6.68M token KV pool** = ~6.4Ɨ concurrent 1M-token requests

- Decode steady-state: ~14-17 t/s with `–enforce-eager`, ~21 t/s with cudagraphs (when patch can apply)

- KV cache stable, 0 preemptions in initial smoke testing

- Model loads ~143 GiB total weights (~50 GB Ɨ 3 large shards + 1 small), ~73 GiB resident per rank after TP split

**Pre-reqs that must be on the image (the `–apply-vllm-pr` flags handle most):**

- jasl/vllm + PR 40991 + PR 41276

- transformers ≄ 5.8.0 (released)

- compressed-tensors 0.15.1a20260428 (the prerelease — newer 20260503 build expects a `scale_fmt` field that pastapaul’s quant doesn’t carry)

- PyTorch 2.11.0+cu130, FlashInfer 0.6.9, Triton 3.6.0

- TORCH_CUDA_ARCH_LIST=12.1a build flag

Can confirm, long prefill also causes concurrency > 1 to collapse t/s to almost 0

I saw on ModelScope that they already have four versions released.

Do you have any idea what % Q4 will lose exactly from the total capacity?

Do you have any idea when Q4 will be available?

I have a system that makes latency extremely low even on 16GB.

Not sure what you mean by ā€˜capacity’, but the smallest model, V4 Flash, which is 160GB, is already mix quantized at mostly 4-bit by Deepseek.

Any other 4-bit quants you find will not be smaller than what Deepseek already provided you. There’s nothing better you can wait for.

Out of the models that can be loaded onto a dual-Spark setup (MiniMax M2.7, DeepSeek-V3-Flash, MIMO V2.5), DeepSeek handles long context better than the others.

We are really looking forward to seeing this model fully optimized and working smoothly.

Mimo needs 4. Tp 2 doesn’t work