DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

ttsiodras · June 7, 2026, 12:18pm

Thank you - I finally have DS4 working! I’ve been using single-node antirez’s custom llama so far, since there was no way to run it on both of the SPARKs I work with… But now that I can, there’s no comparison - 11.5 fps against 40+!.. And that’s with the llama benchy’s Project Gutenberg stuff; on coding (where MTP is very good at predicting tokens) it runs even faster, avg-ing 55-60 in my pi.dev sessions - and with very encouraging (intelligence-wise) result. Put simply - this feels very Sonnet-y.

Only issue so far is that I don’t know how to control effort in pi.dev - settings/thinking level shows “Off (no reasoning)” even though I can clearly see the model is thinking its way through.

A quick llama-benchy result:

$ .venv/bin/llama-benchy --base-url http://localhost:8081/v1 --model deepseek-v4-flash --tokenizer /home/coder/.cache/huggingface/hub/models–deepseek-ai–DeepSeek-V4-Flash/snapshots/6976c7ff1b30a1b2cb7805021b8ba4684041f136/ --pp 2048 --tg 2048 --depth 2048
[transformers] PyTorch was not found. Models won’t be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.7)
Date: 2026-06-07 14:13:46
Benchmarking model: deepseek-v4-flash at http://localhost:8081/v1
Concurrency levels: [1]
Loading text from cache: /home/coder/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 142813
Warming up…
Warmup (User only) complete. Delta: 4 tokens (Server: 26, Local: 22)
Warmup (System+Empty) complete. Delta: 4 tokens (Server: 26, Local: 22)

Running coherence test…
Coherence test PASSED.
Measuring latency using mode: api…
Average latency (api): 5.04 ms
Running test: pp=2048, tg=2048, depth=2048, concurrency=1
Run 1/3 (batch size 1)…
Run 2/3 (batch size 1)…
Run 3/3 (batch size 1)…
Printing results in MD format:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)

deepseek-v4-flash pp2048 @ d2048 1955.99 ± 80.21 2102.73 ± 88.23 2097.70 ± 88.23 2102.73 ± 88.23

deepseek-v4-flash tg2048 @ d2048 38.16 ± 3.48 48.67 ± 2.62

llama-benchy (0.3.7)
date: 2026-06-07 14:13:46 | latency mode: api

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	5838	June 15, 2026
DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10 DGX Spark / GB10 deepseek	126	6626	June 18, 2026
DeepSeek V4 Flash (1,048,576 Context) on 2x DGX Spark – Custom Sparkrun Recipe DGX Spark / GB10 jetson , deepseek	11	576	June 14, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	16017	May 18, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	70	6195	June 16, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1840	May 11, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	2754	May 17, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1284	June 4, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8671	March 14, 2026
MiniMax M3 NVFP4 and NVFP4 REAP 50 for 4x & 2x DGX Sparks DGX Spark / GB10 Projects	36	1830	June 18, 2026

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
deepseek-v4-flash	pp2048 @ d2048	1955.99 ± 80.21		2102.73 ± 88.23	2097.70 ± 88.23	2102.73 ± 88.23
deepseek-v4-flash	tg2048 @ d2048	38.16 ± 3.48	48.67 ± 2.62

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Related topics