DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

tonyd615 · June 2, 2026, 2:33pm

Send your AGENT to this FORUM specifically have them read the whole thing as there has been updates they should know what to do and get you setup

tonyd615 · June 2, 2026, 5:25pm

Seeing some very impressive numbers on some quick benchmarks.

DS4-Flash on 2× DGX Spark — concurrency sweep on Aiden’s b12x image

Sharing real numbers running @aidendle94’s sparkrun-vllm-ds4-gb10:production-ready (the b12x / “unholy-fusion” image) on 2× DGX Spark (GB10), TP=2 over RoCE/ConnectX-7.

Config: vLLM 0.21.1rc1.dev339 · FP8 KV + MXFP4 MoE · MTP=2 · 200K ctx · --enable-flashinfer-autotune · --max-num-seqs 8 · gpu-mem-util 0.8

Concurrency sweep (code prompt, 256 tokens, temp 0, usage-based token count):

C=1: 42 tok/s (single stream)
C=2: 39 /stream · 75 aggregate
C=4: 21 /stream · 84 aggregate
C=6: 24 /stream · 143 aggregate
C=8: 21 /stream · 167 aggregate

Single-stream ~42 tok/s, scaling to ~167 tok/s aggregate at 8-way (8/8 clean, no OOM, no wedge). The --enable-flashinfer-autotune flag @wolttam flagged is doing real work — well above where we were pre-b12x. Per-stream holds ~21-24 tok/s even fully loaded.

KV pool: 16,905 blocks × 256 = ~4.33M tokens. So 500K-context per request is ~8.6× headroom — bumping max-model-len 200K → 500K next.

Huge thanks to @aidendle94 for the image and @wolttam for the 326K + autotune findings. This thread’s been gold. 🙏

aidendle94 · June 2, 2026, 9:09pm

Hah thanks for the shout out. I was gonna share it here but I forgot.

Yes please, if anyone is looking to deploy DSV4 use my image. I haven’t pushed my code yet but if someone wants I can share.

It’s based on a few people’s forks and some DGX specific fixes on top.

For more performance information you can refer to my reddit post:
Deepseek V4 flash performance on DGX Spark : r/LocalLLaMA

corbett_korbett · June 2, 2026, 9:15pm

I have a probably cursed idea. Would running dual sparks with a strix halo that has a 3090 ti egpu even be possible on VLLM? I know llama Cpp has RPC but it’s slow in its current state. If it’s even possible I could push concurrency and context even higher for DS4 flash.

aidendle94 · June 2, 2026, 9:16pm

No. There’s too much latency transporting data around on top of having to rewrite kernels which would be terribly unoptimized with mixed hardware. Llama.cpp takes this route but trades performance for hardware compatibility.

voktolom · June 3, 2026, 6:25am

Yesterday I updated both devices, happy as an elephant))
Thanks to everyone who participated in the DS4F setup!

500K / seqs=8 / gpu=0.85

Context	Context Prefill T/S	Context Decode T/S	Inference Decode `tg128`
4K	1971	39.08	34.45
16K	1992	29.10	34.86
32K	1994	42.97	38.39
128K	1846	40.10	22.02
256K	1632	33.10	33.12
384K	1467	38.03	40.95
480K	1374	41.54	37.03

Recipe

recipe_version: “1”
name: DeepSeek-V4-Flash-B12X-500K-GPU085
description: DeepSeek V4 Flash b12x 500K ctx gpu 0.85 production profile on dual DGX Spark TP=2
model: /models/deepseek-ai-DeepSeek-V4-Flash
container: vllm-node-dsv4-b12x-fix
cluster_only: true

build_args:
mods:

defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
pipeline_parallel: 1
gpu_memory_utilization: 0.85
block_size: 256
max_model_len: 500000
max_num_batched_tokens: 8192
max_num_seqs: 8
served_model_name: deepseek-v4-flash

env:
PATH: /opt/env/bin:/opt/env/nvvm/bin:/opt/env/targets/sbsa-linux/nvvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
CUDA_HOME: /opt/env/targets/sbsa-linux
CUDA_PATH: /opt/env/targets/sbsa-linux
CUDAToolkit_ROOT: /opt/env/targets/sbsa-linux
LD_LIBRARY_PATH: /opt/env/lib:/opt/env/targets/sbsa-linux/lib
CUDAHOSTCXX: /opt/env/bin/aarch64-conda-linux-gnu-g++
NVCC_PREPEND_FLAGS: -ccbin /opt/env/bin/aarch64-conda-linux-gnu-g++ -I/opt/env/targets/sbsa-linux/include/cccl -I/opt/env/targets/sbsa-linux/include
HF_HOME: /cache/huggingface
TORCH_CUDA_ARCH_LIST: 12.1a
FLASHINFER_CUDA_ARCH_LIST: 12.1a
VLLM_ALLOW_LONG_MAX_MODEL_LEN: “1”
VLLM_USE_B12X_MOE: “1”
VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: “256”
FLASHINFER_DISABLE_VERSION_CHECK: “1”
TILELANG_CLEANUP_TEMP_FILES: “1”
DG_JIT_CACHE_DIR: /cache/huggingface/deepgemm-cache
TORCHINDUCTOR_CACHE_DIR: /cache/huggingface/torchinductor-cache
TRITON_CACHE_DIR: /cache/huggingface/triton-cache
TORCH_EXTENSIONS_DIR: /cache/huggingface/torch_extensions
VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
DG_JIT_USE_NVRTC: “0”
DG_JIT_PRINT_COMPILER_COMMAND: “1”
NCCL_NET: IB
NCCL_IB_DISABLE: “0”
NCCL_DEBUG: WARN
VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2
HF_HUB_OFFLINE: “1”
TRANSFORMERS_OFFLINE: “1”
OMP_NUM_THREADS: “8”

command: |
/opt/env/bin/vllm serve /models/deepseek-ai-DeepSeek-V4-Flash
–served-model-name {served_model_name}
–host {host}
–port {port}
–trust-remote-code
–tensor-parallel-size {tensor_parallel}
–pipeline-parallel-size {pipeline_parallel}
–kv-cache-dtype fp8
–block-size {block_size}
–enable-prefix-caching
–max-model-len {max_model_len}
–max-num-seqs {max_num_seqs}
–enable-chunked-prefill
–max-num-batched-tokens {max_num_batched_tokens}
–gpu-memory-utilization {gpu_memory_utilization}
–distributed-executor-backend mp
–compilation-config ‘{“cudagraph_mode”:“FULL_AND_PIECEWISE”,“custom_ops”:[“all”]}’
–speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:2}’
–tokenizer-mode deepseek_v4
–tool-call-parser deepseek_v4
–enable-auto-tool-choice
–reasoning-parser deepseek_v4
–reasoning-config ‘{“reasoning_parser”:“deepseek_v4”,“reasoning_start_str”:“”,“reasoning_end_str”:“”}’
–default-chat-template-kwargs ‘{“thinking”:true,“preserve_thinking”:true}’
–load-format safetensors
–enable-flashinfer-autotune

0rand · June 3, 2026, 7:58am

Thanks. Is the speed metric provided for 8 concurrent requests? Or just 1?

corbett_korbett · June 3, 2026, 8:39am

I did it with llama cpp and it worked but like expected its slow. Only 15tk/s on minimax m2.7 Q8. But even Q8 minimax m2.7 with full f16 kv cache is not very smart. Getting so many things wrong and I genuinely cant notice any more quality answers than the q3.5 minimax m2.7 I used to use when I just had the strix+egpu setup. Looks like I need to try DS4 flash on dual sparks if it really is noticeably smarter. At least I have 408gb of memory to run big models slowly with kinda presently surprised the RPC with the 2.5G ethernet bottleneck is usable.

0rand · June 3, 2026, 9:15am

I just found out the PR to enable Blackwell Cuda 12.x was merged into vllm main on May 20, so no gimmicks needed with cherry picking

voktolom · June 3, 2026, 9:37am

This is a test for a single thread.

Keyper-AI · June 3, 2026, 12:51pm

Need this recipe added to @eugr stack!

voktolom · June 3, 2026, 12:56pm

Run completed (~3h 5m), no OOM. Profile: pp=2048, tg=128, depths 4K–384K, concurrency 1 / 4 / 8.

For C>1 in the tables: per-stream = t/s (req), aggregate = t/s (total).**

C=1**

Context	ctx_pp	ctx_tg	tg128
4K	1719	33.90	40.63
16K	2009	35.28	36.27
32K	2028	38.09	28.70
128K	1850	37.20	30.37
256K	1643	32.25	33.23
384K	1456	42.24	46.53

C=4

Context	ctx_pp (agg)	ctx_pp/stream	ctx_tg/stream	ctx_tg (agg)	tg128/stream	tg128 (agg)
4K	2047	769	15.0	46.1	15.3	49.8
16K	2048	969	10.0	16.2	26.0	41.6
32K	2018	1001	8.0	9.2	13.9	39.2
128K	1860	959	6.9	2.4	13.9	35.6
256K	1640	956	12.7	0.8	28.9	26.0
384K	1461	760	9.7	0.6	5.8	0.6

C=8

Context	ctx_pp (agg)	ctx_pp/stream	ctx_tg/stream	ctx_tg (agg)	tg128/stream	tg128 (agg)
4K	2078	540	10.9	47.5	12.0	61.9
16K	2051	644	6.3	15.9	11.7	57.4
32K	2025	662	4.5	8.4	11.2	47.2
128K	1848	613	4.0	2.0	11.6	44.0
256K	1641	554	4.3	0.9	5.3	0.7
384K	1467	497	4.8	0.5	11.8	0.5

logs/llama-benchy-500k-gpu085-depth-c1-c4-c8

|:------------------|----------------------:|---------------:|-----------------:|--------------:|-----------------:|-----------------------:|-----------------------:|-----------------------:|

| deepseek-v4-flash | ctx_pp @ d4096 (c1) | 1719.38 ± 0.00 | 1719.38 ± 0.00 | | | 2384.53 ± 0.00 | 2382.26 ± 0.00 | 2384.53 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d4096 (c1) | 33.90 ± 0.00 | 33.90 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d4096 (c1) | 1522.19 ± 0.00 | 1522.19 ± 0.00 | | | 1347.69 ± 0.00 | 1345.43 ± 0.00 | 1347.69 ± 0.00 |

| deepseek-v4-flash | tg128 @ d4096 (c1) | 40.63 ± 0.00 | 40.63 ± 0.00 | 47.00 ± 0.00 | 47.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d4096 (c4) | 2046.99 ± 0.00 | 768.77 ± 252.23 | | | 5973.25 ± 1959.79 | 5970.98 ± 1959.79 | 5973.25 ± 1959.79 |

| deepseek-v4-flash | ctx_tg @ d4096 (c4) | 46.08 ± 0.00 | 15.00 ± 3.14 | 82.00 ± 0.00 | 23.50 ± 0.87 | | | |

| deepseek-v4-flash | pp2048 @ d4096 (c4) | 1762.95 ± 0.00 | 658.10 ± 217.11 | | | 3494.29 ± 1152.01 | 3492.02 ± 1152.01 | 3494.29 ± 1152.01 |

| deepseek-v4-flash | tg128 @ d4096 (c4) | 49.79 ± 0.00 | 15.27 ± 2.41 | 80.00 ± 0.00 | 24.50 ± 1.50 | | | |

| deepseek-v4-flash | ctx_pp @ d4096 (c8) | 2078.27 ± 0.00 | 540.16 ± 321.18 | | | 10307.40 ± 4786.69 | 10305.13 ± 4786.69 | 10307.40 ± 4786.69 |

| deepseek-v4-flash | ctx_tg @ d4096 (c8) | 47.52 ± 0.00 | 10.92 ± 4.17 | 160.00 ± 0.00 | 21.12 ± 2.09 | | | |

| deepseek-v4-flash | pp2048 @ d4096 (c8) | 1698.35 ± 0.00 | 404.99 ± 245.77 | | | 6648.94 ± 2776.06 | 6646.67 ± 2776.06 | 6648.94 ± 2776.06 |

| deepseek-v4-flash | tg128 @ d4096 (c8) | 61.93 ± 0.00 | 11.97 ± 2.76 | 141.00 ± 0.00 | 21.38 ± 1.93 | | | |

| deepseek-v4-flash | ctx_pp @ d16384 (c1) | 2008.96 ± 0.00 | 2008.96 ± 0.00 | | | 8157.74 ± 0.00 | 8155.47 ± 0.00 | 8159.29 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d16384 (c1) | 35.28 ± 0.00 | 35.28 ± 0.00 | 41.00 ± 0.00 | 41.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d16384 (c1) | 1451.60 ± 0.00 | 1451.60 ± 0.00 | | | 1413.13 ± 0.00 | 1410.86 ± 0.00 | 1413.13 ± 0.00 |

| deepseek-v4-flash | tg128 @ d16384 (c1) | 36.27 ± 0.00 | 36.27 ± 0.00 | 46.00 ± 0.00 | 46.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d16384 (c4) | 2047.51 ± 0.00 | 968.83 ± 583.67 | | | 22091.35 ± 9006.80 | 22089.08 ± 9006.80 | 22092.47 ± 9007.41 |

| deepseek-v4-flash | ctx_tg @ d16384 (c4) | 16.21 ± 0.00 | 9.98 ± 5.06 | 80.00 ± 0.00 | 23.00 ± 2.24 | | | |

| deepseek-v4-flash | pp2048 @ d16384 (c4) | 1612.02 ± 0.00 | 642.77 ± 414.32 | | | 4184.77 ± 1545.59 | 4182.50 ± 1545.59 | 4187.81 ± 1546.19 |

| deepseek-v4-flash | tg128 @ d16384 (c4) | 41.55 ± 0.00 | 25.97 ± 2.39 | 100.00 ± 0.00 | 25.75 ± 14.50 | | | |

| deepseek-v4-flash | ctx_pp @ d16384 (c8) | 2050.95 ± 0.00 | 644.23 ± 558.86 | | | 38678.64 ± 18310.88 | 38676.37 ± 18310.88 | 38679.92 ± 18310.98 |

| deepseek-v4-flash | ctx_tg @ d16384 (c8) | 15.93 ± 0.00 | 6.27 ± 4.94 | 146.00 ± 0.00 | 22.12 ± 3.06 | | | |

| deepseek-v4-flash | pp2048 @ d16384 (c8) | 1698.69 ± 0.00 | 449.66 ± 356.49 | | | 6521.70 ± 2820.13 | 6519.43 ± 2820.13 | 6522.89 ± 2820.50 |

| deepseek-v4-flash | tg128 @ d16384 (c8) | 57.43 ± 0.00 | 11.72 ± 2.82 | 138.00 ± 0.00 | 20.50 ± 1.94 | | | |

| deepseek-v4-flash | ctx_pp @ d32768 (c1) | 2027.73 ± 0.00 | 2027.73 ± 0.00 | | | 16162.25 ± 0.00 | 16159.98 ± 0.00 | 16165.01 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d32768 (c1) | 38.09 ± 0.00 | 38.09 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d32768 (c1) | 1375.85 ± 0.00 | 1375.85 ± 0.00 | | | 1490.80 ± 0.00 | 1488.53 ± 0.00 | 1493.41 ± 0.00 |

| deepseek-v4-flash | tg128 @ d32768 (c1) | 28.70 ± 0.00 | 28.70 ± 0.00 | 34.00 ± 0.00 | 34.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d32768 (c4) | 2018.35 ± 0.00 | 1001.05 ± 580.28 | | | 42619.11 ± 18132.81 | 42616.84 ± 18132.81 | 42619.77 ± 18131.86 |

| deepseek-v4-flash | ctx_tg @ d32768 (c4) | 9.24 ± 0.00 | 7.99 ± 6.64 | 83.00 ± 0.00 | 23.50 ± 3.77 | | | |

| deepseek-v4-flash | pp2048 @ d32768 (c4) | 1405.28 ± 0.00 | 573.71 ± 384.11 | | | 4779.80 ± 1804.00 | 4777.54 ± 1804.00 | 4785.11 ± 1805.32 |

| deepseek-v4-flash | tg128 @ d32768 (c4) | 39.15 ± 0.00 | 13.86 ± 2.39 | 69.00 ± 0.00 | 21.75 ± 0.83 | | | |

| deepseek-v4-flash | ctx_pp @ d32768 (c8) | 2024.63 ± 0.00 | 662.04 ± 557.37 | | | 75525.94 ± 37064.20 | 75523.67 ± 37064.20 | 75527.38 ± 37063.88 |

| deepseek-v4-flash | ctx_tg @ d32768 (c8) | 8.42 ± 0.00 | 4.47 ± 5.18 | 150.00 ± 0.00 | 19.88 ± 2.20 | | | |

| deepseek-v4-flash | pp2048 @ d32768 (c8) | 1589.98 ± 0.00 | 390.14 ± 308.41 | | | 7105.61 ± 2862.48 | 7355.45 ± 2975.88 | 7361.80 ± 2976.21 |

| deepseek-v4-flash | tg128 @ d32768 (c8) | 47.24 ± 0.00 | 11.23 ± 2.29 | 109.00 ± 0.00 | 20.43 ± 1.59 | | | |

| deepseek-v4-flash | ctx_pp @ d131072 (c1) | 1849.50 ± 0.00 | 1849.50 ± 0.00 | | | 70871.19 ± 0.00 | 70868.92 ± 0.00 | 70881.00 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d131072 (c1) | 37.20 ± 0.00 | 37.20 ± 0.00 | 43.00 ± 0.00 | 43.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d131072 (c1) | 1071.65 ± 0.00 | 1071.65 ± 0.00 | | | 1913.34 ± 0.00 | 1911.07 ± 0.00 | 1913.34 ± 0.00 |

| deepseek-v4-flash | tg128 @ d131072 (c1) | 30.37 ± 0.00 | 30.37 ± 0.00 | 36.00 ± 0.00 | 36.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d131072 (c4) | 1859.98 ± 0.00 | 958.72 ± 542.01 | | | 178150.32 ± 78728.59 | 178148.05 ± 78728.59 | 178150.32 ± 78728.59 |

| deepseek-v4-flash | ctx_tg @ d131072 (c4) | 2.35 ± 0.00 | 6.94 ± 10.14 | 78.00 ± 0.00 | 28.25 ± 6.65 | | | |

| deepseek-v4-flash | pp2048 @ d131072 (c4) | 1074.22 ± 0.00 | 388.52 ± 206.72 | | | 6394.57 ± 2106.82 | 6392.30 ± 2106.82 | 6397.27 ± 2108.37 |

| deepseek-v4-flash | tg128 @ d131072 (c4) | 35.64 ± 0.00 | 13.91 ± 2.92 | 66.00 ± 0.00 | 22.50 ± 3.77 | | | |

| deepseek-v4-flash | ctx_pp @ d131072 (c8) | 1848.45 ± 0.00 | 613.41 ± 490.88 | | | 322768.34 ± 161807.41 | 322766.07 ± 161807.41 | 322769.91 ± 161805.71 |

| deepseek-v4-flash | ctx_tg @ d131072 (c8) | 2.04 ± 0.00 | 3.97 ± 8.53 | 77.00 ± 0.00 | 14.50 ± 14.43 | | | |

| deepseek-v4-flash | pp2048 @ d131072 (c8) | 1164.20 ± 0.00 | 217.94 ± 109.15 | | | 10987.48 ± 3420.83 | 10985.21 ± 3420.83 | 10994.37 ± 3422.60 |

| deepseek-v4-flash | tg128 @ d131072 (c8) | 44.03 ± 0.00 | 11.56 ± 2.99 | 134.00 ± 0.00 | 20.14 ± 2.06 | | | |

| deepseek-v4-flash | ctx_pp @ d262144 (c1) | 1643.04 ± 0.00 | 1643.04 ± 0.00 | | | 159550.41 ± 0.00 | 159548.14 ± 0.00 | 159566.06 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d262144 (c1) | 32.25 ± 0.00 | 32.25 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d262144 (c1) | 697.59 ± 0.00 | 697.59 ± 0.00 | | | 2938.09 ± 0.00 | 2935.82 ± 0.00 | 2938.09 ± 0.00 |

| deepseek-v4-flash | tg128 @ d262144 (c1) | 33.23 ± 0.00 | 33.23 ± 0.00 | 40.00 ± 0.00 | 40.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d262144 (c4) | 1639.51 ± 0.00 | 955.63 ± 514.31 | | | 401002.69 ± 178897.57 | 373909.13 ± 199340.66 | 373916.17 ± 199335.53 |

| deepseek-v4-flash | ctx_tg @ d262144 (c4) | 0.79 ± 0.00 | 12.69 ± 17.32 | 45.00 ± 0.00 | 15.67 ± 20.74 | | | |

| deepseek-v4-flash | pp2048 @ d262144 (c4) | 940.85 ± 0.00 | 295.67 ± 84.65 | | | 6245.94 ± 2568.32 | 7433.43 ± 1770.01 | 7442.84 ± 1775.01 |

| deepseek-v4-flash | tg128 @ d262144 (c4) | 25.95 ± 0.00 | 28.86 ± 7.65 | 70.00 ± 0.00 | 23.67 ± 17.31 | | | |

| deepseek-v4-flash | ctx_pp @ d262144 (c8) | 1641.21 ± 0.00 | 554.28 ± 449.07 | | | 721638.65 ± 365630.00 | 721636.38 ± 365630.00 | 721645.26 ± 365630.36 |

| deepseek-v4-flash | ctx_tg @ d262144 (c8) | 0.91 ± 0.00 | 4.26 ± 10.00 | 65.00 ± 0.00 | 9.12 ± 14.34 | | | |

| deepseek-v4-flash | pp2048 @ d262144 (c8) | 12.70 ± 0.00 | 4.28 ± 3.42 | | | 726724.41 ± 368892.46 | 726722.14 ± 368892.46 | 726731.70 ± 368891.80 |

| deepseek-v4-flash | tg128 @ d262144 (c8) | 0.70 ± 0.00 | 5.33 ± 11.85 | 58.00 ± 0.00 | 9.30 ± 14.39 | | | |

| deepseek-v4-flash | ctx_pp @ d393216 (c1) | 1455.79 ± 0.00 | 1455.79 ± 0.00 | | | 270107.14 ± 0.00 | 270104.87 ± 0.00 | 270131.49 ± 0.00 |

| deepseek-v4-flash | ctx_tg @ d393216 (c1) | 42.24 ± 0.00 | 42.24 ± 0.00 | 54.00 ± 0.00 | 54.00 ± 0.00 | | | |

| deepseek-v4-flash | pp2048 @ d393216 (c1) | 611.52 ± 0.00 | 611.52 ± 0.00 | | | 3351.29 ± 0.00 | 3349.02 ± 0.00 | 3374.80 ± 0.00 |

| deepseek-v4-flash | tg128 @ d393216 (c1) | 46.53 ± 0.00 | 46.53 ± 0.00 | 52.00 ± 0.00 | 52.00 ± 0.00 | | | |

| deepseek-v4-flash | ctx_pp @ d393216 (c4) | 1460.98 ± 0.00 | 759.85 ± 426.29 | | | 674488.41 ± 301169.81 | 674486.14 ± 301169.81 | 674497.96 ± 301160.97 |

| deepseek-v4-flash | ctx_tg @ d393216 (c4) | 0.63 ± 0.00 | 9.66 ± 15.98 | 67.00 ± 0.00 | 17.50 ± 17.23 | | | |

| deepseek-v4-flash | pp2048 @ d393216 (c4) | 10.07 ± 0.00 | 4.37 ± 1.85 | | | 544314.12 ± 190550.25 | 544311.85 ± 190550.25 | 544321.89 ± 190550.25 |

| deepseek-v4-flash | tg128 @ d393216 (c4) | 0.56 ± 0.00 | 5.84 ± 9.46 | 45.00 ± 0.00 | 12.25 ± 11.30 | | | |

| deepseek-v4-flash | ctx_pp @ d393216 (c8) | 1466.64 ± 0.00 | 496.79 ± 402.11 | | | 1208984.56 ± 614144.19 | 1208982.29 ± 614144.19 | 1208990.97 ± 614146.33 |

| deepseek-v4-flash | ctx_tg @ d393216 (c8) | 0.54 ± 0.00 | 4.82 ± 11.74 | 56.00 ± 0.00 | 7.88 ± 14.21 | | | |

| deepseek-v4-flash | pp2048 @ d393216 (c8) | 8.61 ± 0.00 | 69.78 ± 177.40 | | | 956715.20 ± 620475.41 | 956712.93 ± 620475.41 | 956724.42 ± 620480.20 |

| deepseek-v4-flash | tg128 @ d393216 (c8) | 0.47 ± 0.00 | 11.82 ± 18.06 | 47.00 ± 0.00 | 12.38 ± 19.70 | | | |

wolttam · June 3, 2026, 1:27pm

There’s two active forks being used in this thread now - are your numbers from Jasl’s fork or the b12x fork?

tonyd615 · June 3, 2026, 1:34pm

this is a model that wants concurrency and context lol. Something different from Minimax 2.7

voktolom · June 3, 2026, 1:35pm

-- b12x

ciprianveg · June 3, 2026, 1:35pm

hello, can you share the vllm image path id, please?

wolttam · June 3, 2026, 2:09pm

Please do share! Would love to publish the means for people to build their own image (e.g. with the eugr stack)

0rand · June 3, 2026, 2:54pm

Yeah I tried to reproduce but failed with b12x

CosmicRaisins · June 3, 2026, 4:18pm

I really appreciate DeepSeek’s research and commitment to open sourcing, but I don’t understand the fascination with their models.

DSV4 Flash seems to get way more attention compared to other models of its class, like minimax m2.7, mimo v2.5, step 3.7 Flash. More quants, more community engagement.

For the work that I do (game development), DSV4 is just not as capable compared to those models. It’s not multimodal. I have to be much more precise with my prompt, or it misinterprets it, and it makes more mistakes when implementing a plan made by a frontier model. It also seems like a lot of work to get it working optimally. I tried a bunch of stuff and fixes and was able to get its decode speed to around 35-40 tok/s, faster than the models listed above, but pp is still a chunk slower than its peers.

Interested to hear others’ experiences.

savu_silviu · June 3, 2026, 4:23pm

I use it under Pi (with full vs code integration) and under Codex Desktop (with a shim for self hosted models) and it is pretty awesome as it keeps on going. the nice part is that it does not start hallucinating like crazy after 130k like most other non frontier models do. (Anthropic models are not frontier AI, they are frontier marketing)

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	5989	June 15, 2026
DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10 DGX Spark / GB10 deepseek	134	7567	June 21, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	230	June 19, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	73	6574	June 20, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	16143	May 18, 2026
DeepSeek V4 Flash (1,048,576 Context) on 2x DGX Spark – Custom Sparkrun Recipe DGX Spark / GB10 jetson , deepseek	11	653	June 14, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1863	May 11, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	2905	May 17, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1310	June 4, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8717	March 14, 2026

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Run completed (~3h 5m), no OOM. Profile: pp=2048, tg=128, depths 4K–384K, concurrency 1 / 4 / 8.

C=4

C=8

Related topics