Thank you - I finally have DS4 working! I’ve been using single-node antirez’s custom llama so far, since there was no way to run it on both of the SPARKs I work with… But now that I can, there’s no comparison - 11.5 fps against 40+!.. And that’s with the llama benchy’s Project Gutenberg stuff; on coding (where MTP is very good at predicting tokens) it runs even faster, avg-ing 55-60 in my pi.dev sessions - and with very encouraging (intelligence-wise) result. Put simply - this feels very Sonnet-y.
Only issue so far is that I don’t know how to control effort in pi.dev - settings/thinking level shows “Off (no reasoning)” even though I can clearly see the model is thinking its way through.
A quick llama-benchy result:
$ .venv/bin/llama-benchy --base-url http://localhost:8081/v1 --model deepseek-v4-flash --tokenizer /home/coder/.cache/huggingface/hub/models–deepseek-ai–DeepSeek-V4-Flash/snapshots/6976c7ff1b30a1b2cb7805021b8ba4684041f136/ --pp 2048 --tg 2048 --depth 2048
[transformers] PyTorch was not found. Models won’t be available and only tokenizers, configuration and file/data utilities can be used.
llama-benchy (0.3.7)
Date: 2026-06-07 14:13:46
Benchmarking model: deepseek-v4-flash at http://localhost:8081/v1
Concurrency levels: [1]
Loading text from cache: /home/coder/.cache/llama-benchy/cc6a0b5782734ee3b9069aa3b64cc62c.txt
Total tokens available in text corpus: 142813
Warming up…
Warmup (User only) complete. Delta: 4 tokens (Server: 26, Local: 22)
Warmup (System+Empty) complete. Delta: 4 tokens (Server: 26, Local: 22)Running coherence test…
Coherence test PASSED.
Measuring latency using mode: api…
Average latency (api): 5.04 ms
Running test: pp=2048, tg=2048, depth=2048, concurrency=1
Run 1/3 (batch size 1)…
Run 2/3 (batch size 1)…
Run 3/3 (batch size 1)…
Printing results in MD format:
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms) deepseek-v4-flash pp2048 @ d2048 1955.99 ± 80.21 2102.73 ± 88.23 2097.70 ± 88.23 2102.73 ± 88.23 deepseek-v4-flash tg2048 @ d2048 38.16 ± 3.48 48.67 ± 2.62
llama-benchy (0.3.7)
date: 2026-06-07 14:13:46 | latency mode: api