DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Send your AGENT to this FORUM specifically have them read the whole thing as there has been updates they should know what to do and get you setup

Seeing some very impressive numbers on some quick benchmarks.

DS4-Flash on 2× DGX Spark — concurrency sweep on Aiden’s b12x image

Sharing real numbers running @aidendle94’s sparkrun-vllm-ds4-gb10:production-ready (the b12x / “unholy-fusion” image) on 2× DGX Spark (GB10), TP=2 over RoCE/ConnectX-7.

Config: vLLM 0.21.1rc1.dev339 · FP8 KV + MXFP4 MoE · MTP=2 · 200K ctx · --enable-flashinfer-autotune · --max-num-seqs 8 · gpu-mem-util 0.8

Concurrency sweep (code prompt, 256 tokens, temp 0, usage-based token count):

C=1: 42 tok/s (single stream)
C=2: 39 /stream · 75 aggregate
C=4: 21 /stream · 84 aggregate
C=6: 24 /stream · 143 aggregate
C=8: 21 /stream · 167 aggregate

Single-stream ~42 tok/s, scaling to ~167 tok/s aggregate at 8-way (8/8 clean, no OOM, no wedge). The --enable-flashinfer-autotune flag @wolttam flagged is doing real work — well above where we were pre-b12x. Per-stream holds ~21-24 tok/s even fully loaded.

KV pool: 16,905 blocks × 256 = ~4.33M tokens. So 500K-context per request is ~8.6× headroom — bumping max-model-len 200K → 500K next.

Huge thanks to @aidendle94 for the image and @wolttam for the 326K + autotune findings. This thread’s been gold. 🙏

Hah thanks for the shout out. I was gonna share it here but I forgot.

Yes please, if anyone is looking to deploy DSV4 use my image. I haven’t pushed my code yet but if someone wants I can share.

It’s based on a few people’s forks and some DGX specific fixes on top.

For more performance information you can refer to my reddit post:
Deepseek V4 flash performance on DGX Spark : r/LocalLLaMA

I have a probably cursed idea. Would running dual sparks with a strix halo that has a 3090 ti egpu even be possible on VLLM? I know llama Cpp has RPC but it’s slow in its current state. If it’s even possible I could push concurrency and context even higher for DS4 flash.

No. There’s too much latency transporting data around on top of having to rewrite kernels which would be terribly unoptimized with mixed hardware. Llama.cpp takes this route but trades performance for hardware compatibility.

Yesterday I updated both devices, happy as an elephant))
Thanks to everyone who participated in the DS4F setup!

500K / seqs=8 / gpu=0.85

Context Context Prefill T/S Context Decode T/S Inference Decode tg128
4K 1971 39.08 34.45
16K 1992 29.10 34.86
32K 1994 42.97 38.39
128K 1846 40.10 22.02
256K 1632 33.10 33.12
384K 1467 38.03 40.95
480K 1374 41.54 37.03
Recipe

recipe_version: “1”
name: DeepSeek-V4-Flash-B12X-500K-GPU085
description: DeepSeek V4 Flash b12x 500K ctx gpu 0.85 production profile on dual DGX Spark TP=2
model: /models/deepseek-ai-DeepSeek-V4-Flash
container: vllm-node-dsv4-b12x-fix
cluster_only: true

build_args:
mods:

defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
pipeline_parallel: 1
gpu_memory_utilization: 0.85
block_size: 256
max_model_len: 500000
max_num_batched_tokens: 8192
max_num_seqs: 8
served_model_name: deepseek-v4-flash

env:
PATH: /opt/env/bin:/opt/env/nvvm/bin:/opt/env/targets/sbsa-linux/nvvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
CUDA_HOME: /opt/env/targets/sbsa-linux
CUDA_PATH: /opt/env/targets/sbsa-linux
CUDAToolkit_ROOT: /opt/env/targets/sbsa-linux
LD_LIBRARY_PATH: /opt/env/lib:/opt/env/targets/sbsa-linux/lib
CUDAHOSTCXX: /opt/env/bin/aarch64-conda-linux-gnu-g++
NVCC_PREPEND_FLAGS: -ccbin /opt/env/bin/aarch64-conda-linux-gnu-g++ -I/opt/env/targets/sbsa-linux/include/cccl -I/opt/env/targets/sbsa-linux/include
HF_HOME: /cache/huggingface
TORCH_CUDA_ARCH_LIST: 12.1a
FLASHINFER_CUDA_ARCH_LIST: 12.1a
VLLM_ALLOW_LONG_MAX_MODEL_LEN: “1”
VLLM_USE_B12X_MOE: “1”
VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: “256”
FLASHINFER_DISABLE_VERSION_CHECK: “1”
TILELANG_CLEANUP_TEMP_FILES: “1”
DG_JIT_CACHE_DIR: /cache/huggingface/deepgemm-cache
TORCHINDUCTOR_CACHE_DIR: /cache/huggingface/torchinductor-cache
TRITON_CACHE_DIR: /cache/huggingface/triton-cache
TORCH_EXTENSIONS_DIR: /cache/huggingface/torch_extensions
VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
DG_JIT_USE_NVRTC: “0”
DG_JIT_PRINT_COMPILER_COMMAND: “1”
NCCL_NET: IB
NCCL_IB_DISABLE: “0”
NCCL_DEBUG: WARN
VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2
HF_HUB_OFFLINE: “1”
TRANSFORMERS_OFFLINE: “1”
OMP_NUM_THREADS: “8”

command: |
/opt/env/bin/vllm serve /models/deepseek-ai-DeepSeek-V4-Flash
–served-model-name {served_model_name}
–host {host}
–port {port}
–trust-remote-code
–tensor-parallel-size {tensor_parallel}
–pipeline-parallel-size {pipeline_parallel}
–kv-cache-dtype fp8
–block-size {block_size}
–enable-prefix-caching
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–enable-chunked-prefill
–max-num-batched-tokens {max_num_batched_tokens}
–gpu-memory-utilization {gpu_memory_utilization}
–distributed-executor-backend mp
–compilation-config ‘{“cudagraph_mode”:“FULL_AND_PIECEWISE”,“custom_ops”:[“all”]}’
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:2}’
–tokenizer-mode deepseek_v4
–tool-call-parser deepseek_v4
–enable-auto-tool-choice
–reasoning-parser deepseek_v4
–reasoning-config ‘{“reasoning_parser”:“deepseek_v4”,“reasoning_start_str”:“”,“reasoning_end_str”:“”}’
–default-chat-template-kwargs ‘{“thinking”:true,“preserve_thinking”:true}’
–load-format safetensors
–enable-flashinfer-autotune

Thanks. Is the speed metric provided for 8 concurrent requests? Or just 1?

I did it with llama cpp and it worked but like expected its slow. Only 15tk/s on minimax m2.7 Q8. But even Q8 minimax m2.7 with full f16 kv cache is not very smart. Getting so many things wrong and I genuinely cant notice any more quality answers than the q3.5 minimax m2.7 I used to use when I just had the strix+egpu setup. Looks like I need to try DS4 flash on dual sparks if it really is noticeably smarter. At least I have 408gb of memory to run big models slowly with kinda presently surprised the RPC with the 2.5G ethernet bottleneck is usable.

I just found out the PR to enable Blackwell Cuda 12.x was merged into vllm main on May 20, so no gimmicks needed with cherry picking

This is a test for a single thread.

Need this recipe added to @eugr stack!

Run completed (~3h 5m), no OOM. Profile: pp=2048, tg=128, depths 4K–384K, concurrency 1 / 4 / 8.

For C>1 in the tables: per-stream = t/s (req), aggregate = t/s (total).**

C=1**

Context ctx_pp ctx_tg tg128
4K 1719 33.90 40.63
16K 2009 35.28 36.27
32K 2028 38.09 28.70
128K 1850 37.20 30.37
256K 1643 32.25 33.23
384K 1456 42.24 46.53

C=4

Context ctx_pp (agg) ctx_pp/stream ctx_tg/stream ctx_tg (agg) tg128/stream tg128 (agg)
4K 2047 769 15.0 46.1 15.3 49.8
16K 2048 969 10.0 16.2 26.0 41.6
32K 2018 1001 8.0 9.2 13.9 39.2
128K 1860 959 6.9 2.4 13.9 35.6
256K 1640 956 12.7 0.8 28.9 26.0
384K 1461 760 9.7 0.6 5.8 0.6

C=8

Context ctx_pp (agg) ctx_pp/stream ctx_tg/stream ctx_tg (agg) tg128/stream tg128 (agg)
4K 2078 540 10.9 47.5 12.0 61.9
16K 2051 644 6.3 15.9 11.7 57.4
32K 2025 662 4.5 8.4 11.2 47.2
128K 1848 613 4.0 2.0 11.6 44.0
256K 1641 554 4.3 0.9 5.3 0.7
384K 1467 497 4.8 0.5 11.8 0.5
logs/llama-benchy-500k-gpu085-depth-c1-c4-c8

| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |

|:------------------|----------------------:|---------------:|-----------------:|--------------:|-----------------:|-----------------------:|-----------------------:|-----------------------:|

| deepseek-v4-flash | ctx_pp @ d4096 (c1) | 1719.38 ± 0.00 | 1719.38 ± 0.00 | | | 2384.53 ± 0.00 | 2382.26 ± 0.00 | 2384.53 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d4096 (c1) | 33.90 ± 0.00 | 33.90 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d4096 (c1) | 1522.19 ± 0.00 | 1522.19 ± 0.00 | | | 1347.69 ± 0.00 | 1345.43 ± 0.00 | 1347.69 ± 0.00 |

| deepseek-v4-flash | tg128 @ d4096 (c1) | 40.63 ± 0.00 | 40.63 ± 0.00 | 47.00 ± 0.00 | 47.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d4096 (c4) | 2046.99 ± 0.00 | 768.77 ± 252.23 | | | 5973.25 ± 1959.79 | 5970.98 ± 1959.79 | 5973.25 ± 1959.79 |

| deepseek-v4-flash | ctx_tg @ d4096 (c4) | 46.08 ± 0.00 | 15.00 ± 3.14 | 82.00 ± 0.00 | 23.50 ± 0.87 | | | |

| deepseek-v4-flash | pp2048 @ d4096 (c4) | 1762.95 ± 0.00 | 658.10 ± 217.11 | | | 3494.29 ± 1152.01 | 3492.02 ± 1152.01 | 3494.29 ± 1152.01 |

| deepseek-v4-flash | tg128 @ d4096 (c4) | 49.79 ± 0.00 | 15.27 ± 2.41 | 80.00 ± 0.00 | 24.50 ± 1.50 | | | |

| deepseek-v4-flash | ctx_pp @ d4096 (c8) | 2078.27 ± 0.00 | 540.16 ± 321.18 | | | 10307.40 ± 4786.69 | 10305.13 ± 4786.69 | 10307.40 ± 4786.69 |

| deepseek-v4-flash | ctx_tg @ d4096 (c8) | 47.52 ± 0.00 | 10.92 ± 4.17 | 160.00 ± 0.00 | 21.12 ± 2.09 | | | |

| deepseek-v4-flash | pp2048 @ d4096 (c8) | 1698.35 ± 0.00 | 404.99 ± 245.77 | | | 6648.94 ± 2776.06 | 6646.67 ± 2776.06 | 6648.94 ± 2776.06 |

| deepseek-v4-flash | tg128 @ d4096 (c8) | 61.93 ± 0.00 | 11.97 ± 2.76 | 141.00 ± 0.00 | 21.38 ± 1.93 | | | |

| deepseek-v4-flash | ctx_pp @ d16384 (c1) | 2008.96 ± 0.00 | 2008.96 ± 0.00 | | | 8157.74 ± 0.00 | 8155.47 ± 0.00 | 8159.29 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d16384 (c1) | 35.28 ± 0.00 | 35.28 ± 0.00 | 41.00 ± 0.00 | 41.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d16384 (c1) | 1451.60 ± 0.00 | 1451.60 ± 0.00 | | | 1413.13 ± 0.00 | 1410.86 ± 0.00 | 1413.13 ± 0.00 |

| deepseek-v4-flash | tg128 @ d16384 (c1) | 36.27 ± 0.00 | 36.27 ± 0.00 | 46.00 ± 0.00 | 46.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d16384 (c4) | 2047.51 ± 0.00 | 968.83 ± 583.67 | | | 22091.35 ± 9006.80 | 22089.08 ± 9006.80 | 22092.47 ± 9007.41 |

| deepseek-v4-flash | ctx_tg @ d16384 (c4) | 16.21 ± 0.00 | 9.98 ± 5.06 | 80.00 ± 0.00 | 23.00 ± 2.24 | | | |

| deepseek-v4-flash | pp2048 @ d16384 (c4) | 1612.02 ± 0.00 | 642.77 ± 414.32 | | | 4184.77 ± 1545.59 | 4182.50 ± 1545.59 | 4187.81 ± 1546.19 |

| deepseek-v4-flash | tg128 @ d16384 (c4) | 41.55 ± 0.00 | 25.97 ± 2.39 | 100.00 ± 0.00 | 25.75 ± 14.50 | | | |

| deepseek-v4-flash | ctx_pp @ d16384 (c8) | 2050.95 ± 0.00 | 644.23 ± 558.86 | | | 38678.64 ± 18310.88 | 38676.37 ± 18310.88 | 38679.92 ± 18310.98 |

| deepseek-v4-flash | ctx_tg @ d16384 (c8) | 15.93 ± 0.00 | 6.27 ± 4.94 | 146.00 ± 0.00 | 22.12 ± 3.06 | | | |

| deepseek-v4-flash | pp2048 @ d16384 (c8) | 1698.69 ± 0.00 | 449.66 ± 356.49 | | | 6521.70 ± 2820.13 | 6519.43 ± 2820.13 | 6522.89 ± 2820.50 |

| deepseek-v4-flash | tg128 @ d16384 (c8) | 57.43 ± 0.00 | 11.72 ± 2.82 | 138.00 ± 0.00 | 20.50 ± 1.94 | | | |

| deepseek-v4-flash | ctx_pp @ d32768 (c1) | 2027.73 ± 0.00 | 2027.73 ± 0.00 | | | 16162.25 ± 0.00 | 16159.98 ± 0.00 | 16165.01 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d32768 (c1) | 38.09 ± 0.00 | 38.09 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d32768 (c1) | 1375.85 ± 0.00 | 1375.85 ± 0.00 | | | 1490.80 ± 0.00 | 1488.53 ± 0.00 | 1493.41 ± 0.00 |

| deepseek-v4-flash | tg128 @ d32768 (c1) | 28.70 ± 0.00 | 28.70 ± 0.00 | 34.00 ± 0.00 | 34.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d32768 (c4) | 2018.35 ± 0.00 | 1001.05 ± 580.28 | | | 42619.11 ± 18132.81 | 42616.84 ± 18132.81 | 42619.77 ± 18131.86 |

| deepseek-v4-flash | ctx_tg @ d32768 (c4) | 9.24 ± 0.00 | 7.99 ± 6.64 | 83.00 ± 0.00 | 23.50 ± 3.77 | | | |

| deepseek-v4-flash | pp2048 @ d32768 (c4) | 1405.28 ± 0.00 | 573.71 ± 384.11 | | | 4779.80 ± 1804.00 | 4777.54 ± 1804.00 | 4785.11 ± 1805.32 |

| deepseek-v4-flash | tg128 @ d32768 (c4) | 39.15 ± 0.00 | 13.86 ± 2.39 | 69.00 ± 0.00 | 21.75 ± 0.83 | | | |

| deepseek-v4-flash | ctx_pp @ d32768 (c8) | 2024.63 ± 0.00 | 662.04 ± 557.37 | | | 75525.94 ± 37064.20 | 75523.67 ± 37064.20 | 75527.38 ± 37063.88 |

| deepseek-v4-flash | ctx_tg @ d32768 (c8) | 8.42 ± 0.00 | 4.47 ± 5.18 | 150.00 ± 0.00 | 19.88 ± 2.20 | | | |

| deepseek-v4-flash | pp2048 @ d32768 (c8) | 1589.98 ± 0.00 | 390.14 ± 308.41 | | | 7105.61 ± 2862.48 | 7355.45 ± 2975.88 | 7361.80 ± 2976.21 |

| deepseek-v4-flash | tg128 @ d32768 (c8) | 47.24 ± 0.00 | 11.23 ± 2.29 | 109.00 ± 0.00 | 20.43 ± 1.59 | | | |

| deepseek-v4-flash | ctx_pp @ d131072 (c1) | 1849.50 ± 0.00 | 1849.50 ± 0.00 | | | 70871.19 ± 0.00 | 70868.92 ± 0.00 | 70881.00 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d131072 (c1) | 37.20 ± 0.00 | 37.20 ± 0.00 | 43.00 ± 0.00 | 43.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d131072 (c1) | 1071.65 ± 0.00 | 1071.65 ± 0.00 | | | 1913.34 ± 0.00 | 1911.07 ± 0.00 | 1913.34 ± 0.00 |

| deepseek-v4-flash | tg128 @ d131072 (c1) | 30.37 ± 0.00 | 30.37 ± 0.00 | 36.00 ± 0.00 | 36.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d131072 (c4) | 1859.98 ± 0.00 | 958.72 ± 542.01 | | | 178150.32 ± 78728.59 | 178148.05 ± 78728.59 | 178150.32 ± 78728.59 |

| deepseek-v4-flash | ctx_tg @ d131072 (c4) | 2.35 ± 0.00 | 6.94 ± 10.14 | 78.00 ± 0.00 | 28.25 ± 6.65 | | | |

| deepseek-v4-flash | pp2048 @ d131072 (c4) | 1074.22 ± 0.00 | 388.52 ± 206.72 | | | 6394.57 ± 2106.82 | 6392.30 ± 2106.82 | 6397.27 ± 2108.37 |

| deepseek-v4-flash | tg128 @ d131072 (c4) | 35.64 ± 0.00 | 13.91 ± 2.92 | 66.00 ± 0.00 | 22.50 ± 3.77 | | | |

| deepseek-v4-flash | ctx_pp @ d131072 (c8) | 1848.45 ± 0.00 | 613.41 ± 490.88 | | | 322768.34 ± 161807.41 | 322766.07 ± 161807.41 | 322769.91 ± 161805.71 |

| deepseek-v4-flash | ctx_tg @ d131072 (c8) | 2.04 ± 0.00 | 3.97 ± 8.53 | 77.00 ± 0.00 | 14.50 ± 14.43 | | | |

| deepseek-v4-flash | pp2048 @ d131072 (c8) | 1164.20 ± 0.00 | 217.94 ± 109.15 | | | 10987.48 ± 3420.83 | 10985.21 ± 3420.83 | 10994.37 ± 3422.60 |

| deepseek-v4-flash | tg128 @ d131072 (c8) | 44.03 ± 0.00 | 11.56 ± 2.99 | 134.00 ± 0.00 | 20.14 ± 2.06 | | | |

| deepseek-v4-flash | ctx_pp @ d262144 (c1) | 1643.04 ± 0.00 | 1643.04 ± 0.00 | | | 159550.41 ± 0.00 | 159548.14 ± 0.00 | 159566.06 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d262144 (c1) | 32.25 ± 0.00 | 32.25 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d262144 (c1) | 697.59 ± 0.00 | 697.59 ± 0.00 | | | 2938.09 ± 0.00 | 2935.82 ± 0.00 | 2938.09 ± 0.00 |

| deepseek-v4-flash | tg128 @ d262144 (c1) | 33.23 ± 0.00 | 33.23 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d262144 (c4) | 1639.51 ± 0.00 | 955.63 ± 514.31 | | | 401002.69 ± 178897.57 | 373909.13 ± 199340.66 | 373916.17 ± 199335.53 |

| deepseek-v4-flash | ctx_tg @ d262144 (c4) | 0.79 ± 0.00 | 12.69 ± 17.32 | 45.00 ± 0.00 | 15.67 ± 20.74 | | | |

| deepseek-v4-flash | pp2048 @ d262144 (c4) | 940.85 ± 0.00 | 295.67 ± 84.65 | | | 6245.94 ± 2568.32 | 7433.43 ± 1770.01 | 7442.84 ± 1775.01 |

| deepseek-v4-flash | tg128 @ d262144 (c4) | 25.95 ± 0.00 | 28.86 ± 7.65 | 70.00 ± 0.00 | 23.67 ± 17.31 | | | |

| deepseek-v4-flash | ctx_pp @ d262144 (c8) | 1641.21 ± 0.00 | 554.28 ± 449.07 | | | 721638.65 ± 365630.00 | 721636.38 ± 365630.00 | 721645.26 ± 365630.36 |

| deepseek-v4-flash | ctx_tg @ d262144 (c8) | 0.91 ± 0.00 | 4.26 ± 10.00 | 65.00 ± 0.00 | 9.12 ± 14.34 | | | |

| deepseek-v4-flash | pp2048 @ d262144 (c8) | 12.70 ± 0.00 | 4.28 ± 3.42 | | | 726724.41 ± 368892.46 | 726722.14 ± 368892.46 | 726731.70 ± 368891.80 |

| deepseek-v4-flash | tg128 @ d262144 (c8) | 0.70 ± 0.00 | 5.33 ± 11.85 | 58.00 ± 0.00 | 9.30 ± 14.39 | | | |

| deepseek-v4-flash | ctx_pp @ d393216 (c1) | 1455.79 ± 0.00 | 1455.79 ± 0.00 | | | 270107.14 ± 0.00 | 270104.87 ± 0.00 | 270131.49 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d393216 (c1) | 42.24 ± 0.00 | 42.24 ± 0.00 | 54.00 ± 0.00 | 54.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d393216 (c1) | 611.52 ± 0.00 | 611.52 ± 0.00 | | | 3351.29 ± 0.00 | 3349.02 ± 0.00 | 3374.80 ± 0.00 |

| deepseek-v4-flash | tg128 @ d393216 (c1) | 46.53 ± 0.00 | 46.53 ± 0.00 | 52.00 ± 0.00 | 52.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d393216 (c4) | 1460.98 ± 0.00 | 759.85 ± 426.29 | | | 674488.41 ± 301169.81 | 674486.14 ± 301169.81 | 674497.96 ± 301160.97 |

| deepseek-v4-flash | ctx_tg @ d393216 (c4) | 0.63 ± 0.00 | 9.66 ± 15.98 | 67.00 ± 0.00 | 17.50 ± 17.23 | | | |

| deepseek-v4-flash | pp2048 @ d393216 (c4) | 10.07 ± 0.00 | 4.37 ± 1.85 | | | 544314.12 ± 190550.25 | 544311.85 ± 190550.25 | 544321.89 ± 190550.25 |

| deepseek-v4-flash | tg128 @ d393216 (c4) | 0.56 ± 0.00 | 5.84 ± 9.46 | 45.00 ± 0.00 | 12.25 ± 11.30 | | | |

| deepseek-v4-flash | ctx_pp @ d393216 (c8) | 1466.64 ± 0.00 | 496.79 ± 402.11 | | | 1208984.56 ± 614144.19 | 1208982.29 ± 614144.19 | 1208990.97 ± 614146.33 |

| deepseek-v4-flash | ctx_tg @ d393216 (c8) | 0.54 ± 0.00 | 4.82 ± 11.74 | 56.00 ± 0.00 | 7.88 ± 14.21 | | | |

| deepseek-v4-flash | pp2048 @ d393216 (c8) | 8.61 ± 0.00 | 69.78 ± 177.40 | | | 956715.20 ± 620475.41 | 956712.93 ± 620475.41 | 956724.42 ± 620480.20 |

| deepseek-v4-flash | tg128 @ d393216 (c8) | 0.47 ± 0.00 | 11.82 ± 18.06 | 47.00 ± 0.00 | 12.38 ± 19.70 | | | |

There’s two active forks being used in this thread now - are your numbers from Jasl’s fork or the b12x fork?

this is a model that wants concurrency and context lol. Something different from Minimax 2.7

-- b12x

hello, can you share the vllm image path id, please?

Please do share! Would love to publish the means for people to build their own image (e.g. with the eugr stack)

Yeah I tried to reproduce but failed with b12x

I really appreciate DeepSeek’s research and commitment to open sourcing, but I don’t understand the fascination with their models.

DSV4 Flash seems to get way more attention compared to other models of its class, like minimax m2.7, mimo v2.5, step 3.7 Flash. More quants, more community engagement.

For the work that I do (game development), DSV4 is just not as capable compared to those models. It’s not multimodal. I have to be much more precise with my prompt, or it misinterprets it, and it makes more mistakes when implementing a plan made by a frontier model. It also seems like a lot of work to get it working optimally. I tried a bunch of stuff and fixes and was able to get its decode speed to around 35-40 tok/s, faster than the models listed above, but pp is still a chunk slower than its peers.

Interested to hear others’ experiences.

I use it under Pi (with full vs code integration) and under Codex Desktop (with a shim for self hosted models) and it is pretty awesome as it keeps on going. the nice part is that it does not start hallucinating like crazy after 130k like most other non frontier models do. (Anthropic models are not frontier AI, they are frontier marketing)