looks like dual sparks
Transformers 5.8 is out with DeepSeek-V4 support
I donāt think so. Itās really more about the actual implementation of model/kernels for Deepseek v4.
Iām sure at some point NVIDIA will get us updated⦠but I donāt think thatās the problem here. CUDA 13 major features are all we need ā CUDA 13.2 basically brings CUDA-native tile-based kernels/programming which theyāre going to have to bring to Spark⦠but that stuff is early, so weāre not missing out (yet).
Iām curious how much context we could realistically run with two Sparks if we really pushed them. One of the main selling points of DeepSeek V4 is that they found a way to support much larger context windows while using significantly less VRAM. Iād be interested to see how far we could scale context efficiency with a dual Spark setup.
At this rate in six months, who knows maybe we can actually run full context on a 1M context on a dual spark cluster
(EngineCore pid=114) INFO 05-06 03:31:08 [kv_cache_utils.py:1710] GPU KV cache size: 4,627,680 tokens
(EngineCore pid=114) INFO 05-06 03:31:08 [kv_cache_utils.py:1711] Maximum concurrency for 262,144 tokens per request: 17.65x
Hmmm, wait a minute (or several)
[kv_cache_utils.py:1710] GPU KV cache size: 6,463,665 tokens
[kv_cache_utils.py:1711] Maximum concurrency for 1,000,000 tokens per request: 6.46x
1M TOKENS CONTEXT
ITāS RUNNING
Thatās awesome, deep seek V4 flash is not as smart as minimax 2.5 or 2.7 but Iām definitely willing to lose intelligence for one 1M context. When you get time, could you post your recipe on Spark Arena?
Itās a tough recipe with the custom vllm build with PR and everything, I donāt think itās ready yet, but you can try to reproduce it. Itās just the same recipe as above with 1000000 max model length.
Iāll be trying out this version in the next few days ; Iāve been using DS4-Flash for the last two days through their API (so cheap) to see what it can do and itās actually quite smart.
Thatās totally fair. Great job though itās people like you in the community that just push this hardware to its limits itās so exciting.
In my testļ¼4 nodes with crs804 can run raw modelļ¼v4 flashļ¼ in 1m context with 30tokens/s speed in short requests and 15 tokens/s speed in long context.It can use max mode which has 384k output length.
imma try t he recipe when you say its not ready what are issues you are ssing
you can pass build args in the recipe itself: spark-vllm-docker/recipes at main Ā· eugr/spark-vllm-docker Ā· GitHub
Got it working doing this.. 20 t/s
**DeepSeek V4 Flash W4A16-FP8 ā Dual DGX Spark TP=2 ā 1M Context ā VERIFIED WORKING**
Built via [eugr/spark-vllm-docker]( GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks Ā· GitHub ) + 2 patches from [pasta-paul/dsv4-flash-w4a16-fp8]( GitHub - pasta-paul/dsv4-flash-w4a16-fp8: DeepSeek-V4-Flash W4A16-FP8 quantization on 8x H200 ā patches, recipe, mission report Ā· GitHub ). Endpoint stable at 1M context with 85% memory utilization.
**Hardware:** 2Ć DGX Spark GB10 (SM 12.1a, 121 GiB UMA each), QSFP56 200G interconnect at 169.254.x.x.
**Build:**
```bash
cd ~/spark-vllm-docker
./build-and-copy.sh \
--apply-vllm-pr 40991 \
--apply-vllm-pr 41276 \
--rebuild-vllm \
-t vllm-node-dsv4
```
**Apply patches inside the resulting image:**
```bash
docker run --name dsv4-patcher \
-v ~/dsv4-flash-w4a16-fp8/scripts/patch_v4_packed_mapping.py:/tmp/p1.py:ro \
vllm-node-dsv4:latest \
bash -c āDSV4=$(python3 -c āimport vllm.model_executor.models.deepseek_v4 as m; print(m._file_)ā 2>/dev/null | tail -1); python3 /tmp/p1.py ā$DSV4āā
docker commit dsv4-patcher vllm-node-dsv4:latest
docker rm dsv4-patcher
```
(`patch_workspace_prereserve.py` will fail on this vllm build because its target anchor moved ā thatās OK, just use `āenforce-eager` in the serve command instead.)
**Recipe `recipes/deepseek-v4-flash.yaml`:**
```yaml
recipe_version: ā1ā
name: DeepSeek-V4-Flash-W4A16
model: pastapaul/DeepSeek-V4-Flash-W4A16-FP8
container: vllm-node-dsv4
cluster_only: true
mods: []
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.85
max_model_len: 1048576
env:
TORCH_CUDA_ARCH_LIST: ā12.1aā
VLLM_ALLOW_LONG_MAX_MODEL_LEN: ā1ā
command: |
vllm serve pastapaul/DeepSeek-V4-Flash-W4A16-FP8 \
--served-model-name deepseek-v4-flash \\
--trust-remote-code \\
--kv-cache-dtype fp8 \\
--block-size 256 \\
--tokenizer-mode deepseek_v4 \\
--tool-call-parser deepseek_v4 \\
--enable-auto-tool-choice \\
--reasoning-parser deepseek_v4 \\
--max-model-len {max_model_len} \\
--max-num-seqs 4 \\
--max-num-batched-tokens 8192 \\
--gpu-memory-utilization {gpu_memory_utilization} \\
--host {host} \\
--port {port} \\
-tp {tensor_parallel} \\
--enforce-eager \\
--distributed-executor-backend ray
```
**Run:**
```bash
./run-recipe.sh recipes/deepseek-v4-flash.yaml
```
**Notes:**
- `gpu_memory_utilization: 0.85` keeps system memory at 91-92% (`0.92` pushes to 99% Critical on Netdata)
- `āenforce-eager` is required without the workspace prereservation patch (~4Ć decode penalty vs cudagraphs, but stable)
- `max_model_len: 1048576` confirmed working at 1M with `num_gpu_blocks: 26,091` Ć `block_size: 256` = **6.68M token KV pool** = ~6.4Ć concurrent 1M-token requests
- Decode steady-state: ~14-17 t/s with `āenforce-eager`, ~21 t/s with cudagraphs (when patch can apply)
- KV cache stable, 0 preemptions in initial smoke testing
- Model loads ~143 GiB total weights (~50 GB Ć 3 large shards + 1 small), ~73 GiB resident per rank after TP split
**Pre-reqs that must be on the image (the `āapply-vllm-pr` flags handle most):**
- jasl/vllm + PR 40991 + PR 41276
- transformers ā„ 5.8.0 (released)
- compressed-tensors 0.15.1a20260428 (the prerelease ā newer 20260503 build expects a `scale_fmt` field that pastapaulās quant doesnāt carry)
- PyTorch 2.11.0+cu130, FlashInfer 0.6.9, Triton 3.6.0
- TORCH_CUDA_ARCH_LIST=12.1a build flag
I saw on ModelScope that they already have four versions released.
Do you have any idea what % Q4 will lose exactly from the total capacity?
Do you have any idea when Q4 will be available?
I have a system that makes latency extremely low even on 16GB.
Not sure what you mean by ācapacityā, but the smallest model, V4 Flash, which is 160GB, is already mix quantized at mostly 4-bit by Deepseek.
Any other 4-bit quants you find will not be smaller than what Deepseek already provided you. Thereās nothing better you can wait for.
Mimo needs 4. Tp 2 doesnāt work


