Send your AGENT to this FORUM specifically have them read the whole thing as there has been updates they should know what to do and get you setup
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers
Seeing some very impressive numbers on some quick benchmarks.
DS4-Flash on 2× DGX Spark — concurrency sweep on Aiden’s b12x image
Sharing real numbers running @aidendle94’s sparkrun-vllm-ds4-gb10:production-ready (the b12x / “unholy-fusion” image) on 2× DGX Spark (GB10), TP=2 over RoCE/ConnectX-7.
Config: vLLM 0.21.1rc1.dev339 · FP8 KV + MXFP4 MoE · MTP=2 · 200K ctx · --enable-flashinfer-autotune · --max-num-seqs 8 · gpu-mem-util 0.8
Concurrency sweep (code prompt, 256 tokens, temp 0, usage-based token count):
C=1: 42 tok/s (single stream)
C=2: 39 /stream · 75 aggregate
C=4: 21 /stream · 84 aggregate
C=6: 24 /stream · 143 aggregate
C=8: 21 /stream · 167 aggregate
Single-stream ~42 tok/s, scaling to ~167 tok/s aggregate at 8-way (8/8 clean, no OOM, no wedge). The --enable-flashinfer-autotune flag @wolttam flagged is doing real work — well above where we were pre-b12x. Per-stream holds ~21-24 tok/s even fully loaded.
KV pool: 16,905 blocks × 256 = ~4.33M tokens. So 500K-context per request is ~8.6× headroom — bumping max-model-len 200K → 500K next.
Huge thanks to @aidendle94 for the image and @wolttam for the 326K + autotune findings. This thread’s been gold. 🙏
Hah thanks for the shout out. I was gonna share it here but I forgot.
Yes please, if anyone is looking to deploy DSV4 use my image. I haven’t pushed my code yet but if someone wants I can share.
It’s based on a few people’s forks and some DGX specific fixes on top.
For more performance information you can refer to my reddit post:
Deepseek V4 flash performance on DGX Spark : r/LocalLLaMA
I have a probably cursed idea. Would running dual sparks with a strix halo that has a 3090 ti egpu even be possible on VLLM? I know llama Cpp has RPC but it’s slow in its current state. If it’s even possible I could push concurrency and context even higher for DS4 flash.
No. There’s too much latency transporting data around on top of having to rewrite kernels which would be terribly unoptimized with mixed hardware. Llama.cpp takes this route but trades performance for hardware compatibility.
Yesterday I updated both devices, happy as an elephant))
Thanks to everyone who participated in the DS4F setup!
500K / seqs=8 / gpu=0.85
| Context | Context Prefill T/S | Context Decode T/S | Inference Decode tg128 |
|---|---|---|---|
| 4K | 1971 | 39.08 | 34.45 |
| 16K | 1992 | 29.10 | 34.86 |
| 32K | 1994 | 42.97 | 38.39 |
| 128K | 1846 | 40.10 | 22.02 |
| 256K | 1632 | 33.10 | 33.12 |
| 384K | 1467 | 38.03 | 40.95 |
| 480K | 1374 | 41.54 | 37.03 |
Recipe
recipe_version: “1”
name: DeepSeek-V4-Flash-B12X-500K-GPU085
description: DeepSeek V4 Flash b12x 500K ctx gpu 0.85 production profile on dual DGX Spark TP=2
model: /models/deepseek-ai-DeepSeek-V4-Flash
container: vllm-node-dsv4-b12x-fix
cluster_only: true
build_args:
mods:
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
pipeline_parallel: 1
gpu_memory_utilization: 0.85
block_size: 256
max_model_len: 500000
max_num_batched_tokens: 8192
max_num_seqs: 8
served_model_name: deepseek-v4-flash
env:
PATH: /opt/env/bin:/opt/env/nvvm/bin:/opt/env/targets/sbsa-linux/nvvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
CUDA_HOME: /opt/env/targets/sbsa-linux
CUDA_PATH: /opt/env/targets/sbsa-linux
CUDAToolkit_ROOT: /opt/env/targets/sbsa-linux
LD_LIBRARY_PATH: /opt/env/lib:/opt/env/targets/sbsa-linux/lib
CUDAHOSTCXX: /opt/env/bin/aarch64-conda-linux-gnu-g++
NVCC_PREPEND_FLAGS: -ccbin /opt/env/bin/aarch64-conda-linux-gnu-g++ -I/opt/env/targets/sbsa-linux/include/cccl -I/opt/env/targets/sbsa-linux/include
HF_HOME: /cache/huggingface
TORCH_CUDA_ARCH_LIST: 12.1a
FLASHINFER_CUDA_ARCH_LIST: 12.1a
VLLM_ALLOW_LONG_MAX_MODEL_LEN: “1”
VLLM_USE_B12X_MOE: “1”
VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: “256”
FLASHINFER_DISABLE_VERSION_CHECK: “1”
TILELANG_CLEANUP_TEMP_FILES: “1”
DG_JIT_CACHE_DIR: /cache/huggingface/deepgemm-cache
TORCHINDUCTOR_CACHE_DIR: /cache/huggingface/torchinductor-cache
TRITON_CACHE_DIR: /cache/huggingface/triton-cache
TORCH_EXTENSIONS_DIR: /cache/huggingface/torch_extensions
VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
DG_JIT_USE_NVRTC: “0”
DG_JIT_PRINT_COMPILER_COMMAND: “1”
NCCL_NET: IB
NCCL_IB_DISABLE: “0”
NCCL_DEBUG: WARN
VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2
HF_HUB_OFFLINE: “1”
TRANSFORMERS_OFFLINE: “1”
OMP_NUM_THREADS: “8”
command: |
/opt/env/bin/vllm serve /models/deepseek-ai-DeepSeek-V4-Flash
–served-model-name {served_model_name}
–host {host}
–port {port}
–trust-remote-code
–tensor-parallel-size {tensor_parallel}
–pipeline-parallel-size {pipeline_parallel}
–kv-cache-dtype fp8
–block-size {block_size}
–enable-prefix-caching
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–enable-chunked-prefill
–max-num-batched-tokens {max_num_batched_tokens}
–gpu-memory-utilization {gpu_memory_utilization}
–distributed-executor-backend mp
–compilation-config ‘{“cudagraph_mode”:“FULL_AND_PIECEWISE”,“custom_ops”:[“all”]}’
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:2}’
–tokenizer-mode deepseek_v4
–tool-call-parser deepseek_v4
–enable-auto-tool-choice
–reasoning-parser deepseek_v4
–reasoning-config ‘{“reasoning_parser”:“deepseek_v4”,“reasoning_start_str”:“”,“reasoning_end_str”:“”}’
–default-chat-template-kwargs ‘{“thinking”:true,“preserve_thinking”:true}’
–load-format safetensors
–enable-flashinfer-autotune
Thanks. Is the speed metric provided for 8 concurrent requests? Or just 1?
I did it with llama cpp and it worked but like expected its slow. Only 15tk/s on minimax m2.7 Q8. But even Q8 minimax m2.7 with full f16 kv cache is not very smart. Getting so many things wrong and I genuinely cant notice any more quality answers than the q3.5 minimax m2.7 I used to use when I just had the strix+egpu setup. Looks like I need to try DS4 flash on dual sparks if it really is noticeably smarter. At least I have 408gb of memory to run big models slowly with kinda presently surprised the RPC with the 2.5G ethernet bottleneck is usable.
I just found out the PR to enable Blackwell Cuda 12.x was merged into vllm main on May 20, so no gimmicks needed with cherry picking
This is a test for a single thread.
Need this recipe added to @eugr stack!
Run completed (~3h 5m), no OOM. Profile: pp=2048, tg=128, depths 4K–384K, concurrency 1 / 4 / 8.
For C>1 in the tables: per-stream = t/s (req), aggregate = t/s (total).**
C=1**
| Context | ctx_pp | ctx_tg | tg128 |
|---|---|---|---|
| 4K | 1719 | 33.90 | 40.63 |
| 16K | 2009 | 35.28 | 36.27 |
| 32K | 2028 | 38.09 | 28.70 |
| 128K | 1850 | 37.20 | 30.37 |
| 256K | 1643 | 32.25 | 33.23 |
| 384K | 1456 | 42.24 | 46.53 |
C=4
| Context | ctx_pp (agg) | ctx_pp/stream | ctx_tg/stream | ctx_tg (agg) | tg128/stream | tg128 (agg) |
|---|---|---|---|---|---|---|
| 4K | 2047 | 769 | 15.0 | 46.1 | 15.3 | 49.8 |
| 16K | 2048 | 969 | 10.0 | 16.2 | 26.0 | 41.6 |
| 32K | 2018 | 1001 | 8.0 | 9.2 | 13.9 | 39.2 |
| 128K | 1860 | 959 | 6.9 | 2.4 | 13.9 | 35.6 |
| 256K | 1640 | 956 | 12.7 | 0.8 | 28.9 | 26.0 |
| 384K | 1461 | 760 | 9.7 | 0.6 | 5.8 | 0.6 |
C=8
| Context | ctx_pp (agg) | ctx_pp/stream | ctx_tg/stream | ctx_tg (agg) | tg128/stream | tg128 (agg) |
|---|---|---|---|---|---|---|
| 4K | 2078 | 540 | 10.9 | 47.5 | 12.0 | 61.9 |
| 16K | 2051 | 644 | 6.3 | 15.9 | 11.7 | 57.4 |
| 32K | 2025 | 662 | 4.5 | 8.4 | 11.2 | 47.2 |
| 128K | 1848 | 613 | 4.0 | 2.0 | 11.6 | 44.0 |
| 256K | 1641 | 554 | 4.3 | 0.9 | 5.3 | 0.7 |
| 384K | 1467 | 497 | 4.8 | 0.5 | 11.8 | 0.5 |
logs/llama-benchy-500k-gpu085-depth-c1-c4-c8
| model | test | t/s (total) | t/s (req) | peak t/s | peak t/s (req) | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:------------------|----------------------:|---------------:|-----------------:|--------------:|-----------------:|-----------------------:|-----------------------:|-----------------------:|
| deepseek-v4-flash | ctx_pp @ d4096 (c1) | 1719.38 ± 0.00 | 1719.38 ± 0.00 | | | 2384.53 ± 0.00 | 2382.26 ± 0.00 | 2384.53 ± 0.00 |
| deepseek-v4-flash | ctx_tg @ d4096 (c1) | 33.90 ± 0.00 | 33.90 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |
| deepseek-v4-flash | pp2048 @ d4096 (c1) | 1522.19 ± 0.00 | 1522.19 ± 0.00 | | | 1347.69 ± 0.00 | 1345.43 ± 0.00 | 1347.69 ± 0.00 |
| deepseek-v4-flash | tg128 @ d4096 (c1) | 40.63 ± 0.00 | 40.63 ± 0.00 | 47.00 ± 0.00 | 47.00 ± 0.00 | | | |
| deepseek-v4-flash | ctx_pp @ d4096 (c4) | 2046.99 ± 0.00 | 768.77 ± 252.23 | | | 5973.25 ± 1959.79 | 5970.98 ± 1959.79 | 5973.25 ± 1959.79 |
| deepseek-v4-flash | ctx_tg @ d4096 (c4) | 46.08 ± 0.00 | 15.00 ± 3.14 | 82.00 ± 0.00 | 23.50 ± 0.87 | | | |
| deepseek-v4-flash | pp2048 @ d4096 (c4) | 1762.95 ± 0.00 | 658.10 ± 217.11 | | | 3494.29 ± 1152.01 | 3492.02 ± 1152.01 | 3494.29 ± 1152.01 |
| deepseek-v4-flash | tg128 @ d4096 (c4) | 49.79 ± 0.00 | 15.27 ± 2.41 | 80.00 ± 0.00 | 24.50 ± 1.50 | | | |
| deepseek-v4-flash | ctx_pp @ d4096 (c8) | 2078.27 ± 0.00 | 540.16 ± 321.18 | | | 10307.40 ± 4786.69 | 10305.13 ± 4786.69 | 10307.40 ± 4786.69 |
| deepseek-v4-flash | ctx_tg @ d4096 (c8) | 47.52 ± 0.00 | 10.92 ± 4.17 | 160.00 ± 0.00 | 21.12 ± 2.09 | | | |
| deepseek-v4-flash | pp2048 @ d4096 (c8) | 1698.35 ± 0.00 | 404.99 ± 245.77 | | | 6648.94 ± 2776.06 | 6646.67 ± 2776.06 | 6648.94 ± 2776.06 |
| deepseek-v4-flash | tg128 @ d4096 (c8) | 61.93 ± 0.00 | 11.97 ± 2.76 | 141.00 ± 0.00 | 21.38 ± 1.93 | | | |
| deepseek-v4-flash | ctx_pp @ d16384 (c1) | 2008.96 ± 0.00 | 2008.96 ± 0.00 | | | 8157.74 ± 0.00 | 8155.47 ± 0.00 | 8159.29 ± 0.00 |
| deepseek-v4-flash | ctx_tg @ d16384 (c1) | 35.28 ± 0.00 | 35.28 ± 0.00 | 41.00 ± 0.00 | 41.00 ± 0.00 | | | |
| deepseek-v4-flash | pp2048 @ d16384 (c1) | 1451.60 ± 0.00 | 1451.60 ± 0.00 | | | 1413.13 ± 0.00 | 1410.86 ± 0.00 | 1413.13 ± 0.00 |
| deepseek-v4-flash | tg128 @ d16384 (c1) | 36.27 ± 0.00 | 36.27 ± 0.00 | 46.00 ± 0.00 | 46.00 ± 0.00 | | | |
| deepseek-v4-flash | ctx_pp @ d16384 (c4) | 2047.51 ± 0.00 | 968.83 ± 583.67 | | | 22091.35 ± 9006.80 | 22089.08 ± 9006.80 | 22092.47 ± 9007.41 |
| deepseek-v4-flash | ctx_tg @ d16384 (c4) | 16.21 ± 0.00 | 9.98 ± 5.06 | 80.00 ± 0.00 | 23.00 ± 2.24 | | | |
| deepseek-v4-flash | pp2048 @ d16384 (c4) | 1612.02 ± 0.00 | 642.77 ± 414.32 | | | 4184.77 ± 1545.59 | 4182.50 ± 1545.59 | 4187.81 ± 1546.19 |
| deepseek-v4-flash | tg128 @ d16384 (c4) | 41.55 ± 0.00 | 25.97 ± 2.39 | 100.00 ± 0.00 | 25.75 ± 14.50 | | | |
| deepseek-v4-flash | ctx_pp @ d16384 (c8) | 2050.95 ± 0.00 | 644.23 ± 558.86 | | | 38678.64 ± 18310.88 | 38676.37 ± 18310.88 | 38679.92 ± 18310.98 |
| deepseek-v4-flash | ctx_tg @ d16384 (c8) | 15.93 ± 0.00 | 6.27 ± 4.94 | 146.00 ± 0.00 | 22.12 ± 3.06 | | | |
| deepseek-v4-flash | pp2048 @ d16384 (c8) | 1698.69 ± 0.00 | 449.66 ± 356.49 | | | 6521.70 ± 2820.13 | 6519.43 ± 2820.13 | 6522.89 ± 2820.50 |
| deepseek-v4-flash | tg128 @ d16384 (c8) | 57.43 ± 0.00 | 11.72 ± 2.82 | 138.00 ± 0.00 | 20.50 ± 1.94 | | | |
| deepseek-v4-flash | ctx_pp @ d32768 (c1) | 2027.73 ± 0.00 | 2027.73 ± 0.00 | | | 16162.25 ± 0.00 | 16159.98 ± 0.00 | 16165.01 ± 0.00 |
| deepseek-v4-flash | ctx_tg @ d32768 (c1) | 38.09 ± 0.00 | 38.09 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |
| deepseek-v4-flash | pp2048 @ d32768 (c1) | 1375.85 ± 0.00 | 1375.85 ± 0.00 | | | 1490.80 ± 0.00 | 1488.53 ± 0.00 | 1493.41 ± 0.00 |
| deepseek-v4-flash | tg128 @ d32768 (c1) | 28.70 ± 0.00 | 28.70 ± 0.00 | 34.00 ± 0.00 | 34.00 ± 0.00 | | | |
| deepseek-v4-flash | ctx_pp @ d32768 (c4) | 2018.35 ± 0.00 | 1001.05 ± 580.28 | | | 42619.11 ± 18132.81 | 42616.84 ± 18132.81 | 42619.77 ± 18131.86 |
| deepseek-v4-flash | ctx_tg @ d32768 (c4) | 9.24 ± 0.00 | 7.99 ± 6.64 | 83.00 ± 0.00 | 23.50 ± 3.77 | | | |
| deepseek-v4-flash | pp2048 @ d32768 (c4) | 1405.28 ± 0.00 | 573.71 ± 384.11 | | | 4779.80 ± 1804.00 | 4777.54 ± 1804.00 | 4785.11 ± 1805.32 |
| deepseek-v4-flash | tg128 @ d32768 (c4) | 39.15 ± 0.00 | 13.86 ± 2.39 | 69.00 ± 0.00 | 21.75 ± 0.83 | | | |
| deepseek-v4-flash | ctx_pp @ d32768 (c8) | 2024.63 ± 0.00 | 662.04 ± 557.37 | | | 75525.94 ± 37064.20 | 75523.67 ± 37064.20 | 75527.38 ± 37063.88 |
| deepseek-v4-flash | ctx_tg @ d32768 (c8) | 8.42 ± 0.00 | 4.47 ± 5.18 | 150.00 ± 0.00 | 19.88 ± 2.20 | | | |
| deepseek-v4-flash | pp2048 @ d32768 (c8) | 1589.98 ± 0.00 | 390.14 ± 308.41 | | | 7105.61 ± 2862.48 | 7355.45 ± 2975.88 | 7361.80 ± 2976.21 |
| deepseek-v4-flash | tg128 @ d32768 (c8) | 47.24 ± 0.00 | 11.23 ± 2.29 | 109.00 ± 0.00 | 20.43 ± 1.59 | | | |
| deepseek-v4-flash | ctx_pp @ d131072 (c1) | 1849.50 ± 0.00 | 1849.50 ± 0.00 | | | 70871.19 ± 0.00 | 70868.92 ± 0.00 | 70881.00 ± 0.00 |
| deepseek-v4-flash | ctx_tg @ d131072 (c1) | 37.20 ± 0.00 | 37.20 ± 0.00 | 43.00 ± 0.00 | 43.00 ± 0.00 | | | |
| deepseek-v4-flash | pp2048 @ d131072 (c1) | 1071.65 ± 0.00 | 1071.65 ± 0.00 | | | 1913.34 ± 0.00 | 1911.07 ± 0.00 | 1913.34 ± 0.00 |
| deepseek-v4-flash | tg128 @ d131072 (c1) | 30.37 ± 0.00 | 30.37 ± 0.00 | 36.00 ± 0.00 | 36.00 ± 0.00 | | | |
| deepseek-v4-flash | ctx_pp @ d131072 (c4) | 1859.98 ± 0.00 | 958.72 ± 542.01 | | | 178150.32 ± 78728.59 | 178148.05 ± 78728.59 | 178150.32 ± 78728.59 |
| deepseek-v4-flash | ctx_tg @ d131072 (c4) | 2.35 ± 0.00 | 6.94 ± 10.14 | 78.00 ± 0.00 | 28.25 ± 6.65 | | | |
| deepseek-v4-flash | pp2048 @ d131072 (c4) | 1074.22 ± 0.00 | 388.52 ± 206.72 | | | 6394.57 ± 2106.82 | 6392.30 ± 2106.82 | 6397.27 ± 2108.37 |
| deepseek-v4-flash | tg128 @ d131072 (c4) | 35.64 ± 0.00 | 13.91 ± 2.92 | 66.00 ± 0.00 | 22.50 ± 3.77 | | | |
| deepseek-v4-flash | ctx_pp @ d131072 (c8) | 1848.45 ± 0.00 | 613.41 ± 490.88 | | | 322768.34 ± 161807.41 | 322766.07 ± 161807.41 | 322769.91 ± 161805.71 |
| deepseek-v4-flash | ctx_tg @ d131072 (c8) | 2.04 ± 0.00 | 3.97 ± 8.53 | 77.00 ± 0.00 | 14.50 ± 14.43 | | | |
| deepseek-v4-flash | pp2048 @ d131072 (c8) | 1164.20 ± 0.00 | 217.94 ± 109.15 | | | 10987.48 ± 3420.83 | 10985.21 ± 3420.83 | 10994.37 ± 3422.60 |
| deepseek-v4-flash | tg128 @ d131072 (c8) | 44.03 ± 0.00 | 11.56 ± 2.99 | 134.00 ± 0.00 | 20.14 ± 2.06 | | | |
| deepseek-v4-flash | ctx_pp @ d262144 (c1) | 1643.04 ± 0.00 | 1643.04 ± 0.00 | | | 159550.41 ± 0.00 | 159548.14 ± 0.00 | 159566.06 ± 0.00 |
| deepseek-v4-flash | ctx_tg @ d262144 (c1) | 32.25 ± 0.00 | 32.25 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |
| deepseek-v4-flash | pp2048 @ d262144 (c1) | 697.59 ± 0.00 | 697.59 ± 0.00 | | | 2938.09 ± 0.00 | 2935.82 ± 0.00 | 2938.09 ± 0.00 |
| deepseek-v4-flash | tg128 @ d262144 (c1) | 33.23 ± 0.00 | 33.23 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |
| deepseek-v4-flash | ctx_pp @ d262144 (c4) | 1639.51 ± 0.00 | 955.63 ± 514.31 | | | 401002.69 ± 178897.57 | 373909.13 ± 199340.66 | 373916.17 ± 199335.53 |
| deepseek-v4-flash | ctx_tg @ d262144 (c4) | 0.79 ± 0.00 | 12.69 ± 17.32 | 45.00 ± 0.00 | 15.67 ± 20.74 | | | |
| deepseek-v4-flash | pp2048 @ d262144 (c4) | 940.85 ± 0.00 | 295.67 ± 84.65 | | | 6245.94 ± 2568.32 | 7433.43 ± 1770.01 | 7442.84 ± 1775.01 |
| deepseek-v4-flash | tg128 @ d262144 (c4) | 25.95 ± 0.00 | 28.86 ± 7.65 | 70.00 ± 0.00 | 23.67 ± 17.31 | | | |
| deepseek-v4-flash | ctx_pp @ d262144 (c8) | 1641.21 ± 0.00 | 554.28 ± 449.07 | | | 721638.65 ± 365630.00 | 721636.38 ± 365630.00 | 721645.26 ± 365630.36 |
| deepseek-v4-flash | ctx_tg @ d262144 (c8) | 0.91 ± 0.00 | 4.26 ± 10.00 | 65.00 ± 0.00 | 9.12 ± 14.34 | | | |
| deepseek-v4-flash | pp2048 @ d262144 (c8) | 12.70 ± 0.00 | 4.28 ± 3.42 | | | 726724.41 ± 368892.46 | 726722.14 ± 368892.46 | 726731.70 ± 368891.80 |
| deepseek-v4-flash | tg128 @ d262144 (c8) | 0.70 ± 0.00 | 5.33 ± 11.85 | 58.00 ± 0.00 | 9.30 ± 14.39 | | | |
| deepseek-v4-flash | ctx_pp @ d393216 (c1) | 1455.79 ± 0.00 | 1455.79 ± 0.00 | | | 270107.14 ± 0.00 | 270104.87 ± 0.00 | 270131.49 ± 0.00 |
| deepseek-v4-flash | ctx_tg @ d393216 (c1) | 42.24 ± 0.00 | 42.24 ± 0.00 | 54.00 ± 0.00 | 54.00 ± 0.00 | | | |
| deepseek-v4-flash | pp2048 @ d393216 (c1) | 611.52 ± 0.00 | 611.52 ± 0.00 | | | 3351.29 ± 0.00 | 3349.02 ± 0.00 | 3374.80 ± 0.00 |
| deepseek-v4-flash | tg128 @ d393216 (c1) | 46.53 ± 0.00 | 46.53 ± 0.00 | 52.00 ± 0.00 | 52.00 ± 0.00 | | | |
| deepseek-v4-flash | ctx_pp @ d393216 (c4) | 1460.98 ± 0.00 | 759.85 ± 426.29 | | | 674488.41 ± 301169.81 | 674486.14 ± 301169.81 | 674497.96 ± 301160.97 |
| deepseek-v4-flash | ctx_tg @ d393216 (c4) | 0.63 ± 0.00 | 9.66 ± 15.98 | 67.00 ± 0.00 | 17.50 ± 17.23 | | | |
| deepseek-v4-flash | pp2048 @ d393216 (c4) | 10.07 ± 0.00 | 4.37 ± 1.85 | | | 544314.12 ± 190550.25 | 544311.85 ± 190550.25 | 544321.89 ± 190550.25 |
| deepseek-v4-flash | tg128 @ d393216 (c4) | 0.56 ± 0.00 | 5.84 ± 9.46 | 45.00 ± 0.00 | 12.25 ± 11.30 | | | |
| deepseek-v4-flash | ctx_pp @ d393216 (c8) | 1466.64 ± 0.00 | 496.79 ± 402.11 | | | 1208984.56 ± 614144.19 | 1208982.29 ± 614144.19 | 1208990.97 ± 614146.33 |
| deepseek-v4-flash | ctx_tg @ d393216 (c8) | 0.54 ± 0.00 | 4.82 ± 11.74 | 56.00 ± 0.00 | 7.88 ± 14.21 | | | |
| deepseek-v4-flash | pp2048 @ d393216 (c8) | 8.61 ± 0.00 | 69.78 ± 177.40 | | | 956715.20 ± 620475.41 | 956712.93 ± 620475.41 | 956724.42 ± 620480.20 |
| deepseek-v4-flash | tg128 @ d393216 (c8) | 0.47 ± 0.00 | 11.82 ± 18.06 | 47.00 ± 0.00 | 12.38 ± 19.70 | | | |
There’s two active forks being used in this thread now - are your numbers from Jasl’s fork or the b12x fork?
this is a model that wants concurrency and context lol. Something different from Minimax 2.7
-- b12x
hello, can you share the vllm image path id, please?
Please do share! Would love to publish the means for people to build their own image (e.g. with the eugr stack)
Yeah I tried to reproduce but failed with b12x
I really appreciate DeepSeek’s research and commitment to open sourcing, but I don’t understand the fascination with their models.
DSV4 Flash seems to get way more attention compared to other models of its class, like minimax m2.7, mimo v2.5, step 3.7 Flash. More quants, more community engagement.
For the work that I do (game development), DSV4 is just not as capable compared to those models. It’s not multimodal. I have to be much more precise with my prompt, or it misinterprets it, and it makes more mistakes when implementing a plan made by a frontier model. It also seems like a lot of work to get it working optimally. I tried a bunch of stuff and fixes and was able to get its decode speed to around 35-40 tok/s, faster than the models listed above, but pp is still a chunk slower than its peers.
Interested to hear others’ experiences.
I use it under Pi (with full vs code integration) and under Codex Desktop (with a shim for self hosted models) and it is pretty awesome as it keeps on going. the nice part is that it does not start hallucinating like crazy after 130k like most other non frontier models do. (Anthropic models are not frontier AI, they are frontier marketing)