DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10

0rand · June 12, 2026, 6:54pm

No offense taken - its not my recipe, I just wanted to share it as it was buried in a main thread amidst many discussions and experiments. Plus we had few small hurdles to start and we fixed them and shared above. Could it be better? Absolutely. I would love better context handling. Is it usable for work as it is? Yes.

aidendle94 · June 13, 2026, 7:57am

Hey guys. Been busy at work. Will push branch and new image on Sunday when I am back in town.

My fix was to apply a PR that fixed some metadata accumulation. It should help.

Was going to task Fable with this but as you may have heard…lol

0rand · June 13, 2026, 10:09am

The man of the hour himself! Thank you, Good Sir, enjoy you weekend. Whenever you drop it - it will be greatly appreciated!

shawndo · June 13, 2026, 9:34pm

running into, what seems like a basic config issue, but I can’t see what I’m doing wrong.
i set the NICs in the compose:

  NCCL_IB_HCA: "rocep1s0f0,roceP2p1s0f0"
  NCCL_SOCKET_IFNAME: "enp1s0f0np0,enP2p1s0f0np0"

but I’m getting an error for an unused interface. is “enP7s7” hardcoded somewhere ?

(Worker pid=31) ERROR 06-13 21:25:05 [multiproc_executor.py:870] RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:84] ifa != nullptr. Unable to find address for: enP7s7

Update: i was able to address this with the following in the compose environment section:

GLOO_SOCKET_IFNAME: enp1s0f0np0
TP_SOCKET_IFNAME: enp1s0f0np0

not sure if you need both

njzc · June 13, 2026, 11:44pm

I’m using this repo GitHub - MiaAI-Lab/DeepSeek-V4-Flash-Dual-DGX-Spark-1M-Context: Deploy DeepSeek V4 Flash (MoE reasoning model) on dual DGX Spark nodes with 1M token context, InfiniBand, and FP8 KV-cache · GitHub

It works well

jc2375 · June 13, 2026, 11:55pm

Author just commented a couple of posts back in this thread. Looked at the recipe and it’s exactly what the original implementation has. Just like them and the others who say it works…it works for me. Running as main model now, it is excellent to have a model used by most in the cloud as a local agent driver.

Now if we can get Minimax M3 going…only minus for DS4F is the lack of a vision tower :)

0rand · June 14, 2026, 5:45am

It’s identical in essence, she went an extra length to setup a docker compose lonely file as a github repo for easy clone instead of mkdir + cat. Kudos to her, less opportunity to mess up

0rand · June 14, 2026, 5:48am

Unless you use vision non stop most harnesses allow to route vision separately and a cloud vision model can be used from time to time. Solves the problem. Or if you have a gaming card on you workstation, small vision model can operate there

tonyd615 · June 14, 2026, 12:29pm

Qwen 3.5 .8 MLX is like 1-2gb I think it does great captioning of photos

0rand · June 14, 2026, 1:15pm

Yeah you can run it alongside main model on one of the sparks easy.

redacted.design · June 14, 2026, 6:07pm

Shout out to all on this topic. Very good effort and outcome. Thanks @MiaAI_Lab for the git, it allowed me the brief Sunday coffee time to enable the model here without too much pre-wake-up thinking. Hermes Agent + DSv4F is pretty good, fits an interaction I was missing. Thanks @aidendle94 and @0rand for the kick-off on all this, I almost dismissed this model due to lack of attention left in my wet brain :D

njzc · June 14, 2026, 6:25pm

The thing is that the recipe you posted here was modified by the Forum somehow.

e.g, VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/…/libnccl.so.2

0rand · June 14, 2026, 6:36pm

Yeah, but it’s best to check that things locally, you may have it installed differently

Find / - -name “libnccl.so.2”

jc2375 · June 15, 2026, 4:13am

Are you running mlx on the arm/cuda environment?

aidendle94 · June 15, 2026, 5:16am

Minor update

aidendle94/sparkrun-vllm-ds4-gb10:production-v2 with [Bugfix] Fix linear host RSS growth under sustained classification load with prefix caching (V1) by Oxygen56 · Pull Request #44237 · vllm-project/vllm

edit:
Will push code tomorrow to github. Sleepy time

kan11 · June 15, 2026, 12:49pm

Thanks for this and have a good night.

I tested this and it seems to work. KV cache clears periodically. I let codex hammer it with 8 concurrent requests with ~455k-token unique prompts to force prefix-cache eviction pressure.

Observed behavior:

KV usage did not monotonically climb.
Under pressure, KV showed repeated eviction drops, for example:
- 0.389 → 0.334
- 0.410 → 0.339
- 0.411 → 0.338
After all requests drained, with running=0 and waiting=0, KV stayed flat at idle.
Final idle samples stayed exactly stable around 0.2212, and a later check stayed stable around 0.2253.
Prefix cache was active and not poisoned:
- prefix_cache_queries_total ≈ 9.36M
- prefix_cache_hits_total ≈ 4.98M
Server-side errors/aborts stayed at 0

May need to run it for a few days to confirmed.

rle · June 15, 2026, 3:01pm

A big thank you is in order for your hard work. Appreciated!

0rand · June 15, 2026, 8:56pm

Thank you very much, upgraded. Will report results when I get close to limit (if it happens)!

0rand · June 16, 2026, 5:09am

Empirical evidence but evening of heavy use and cache grew to only 45 % and goes up very slowly. Usually much faster. Same pp and tg. Fix is working!

PS works like magic!

 rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:31 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.64, Accepted throughput: 25.00 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 250 tokens, Drafted: 304 tokens, Per-position acceptance rate: 0.974, 0.671, Avg Draft acceptance rate: 82.2%
(APIServer pid=1) INFO 06-16 14:11:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:41 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.49, Accepted throughput: 23.60 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 236 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.886, 0.608, Avg Draft acceptance rate: 74.7%
(APIServer pid=1) INFO 06-16 14:11:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:51 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.46, Accepted throughput: 23.00 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 230 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.892, 0.563, Avg Draft acceptance rate: 72.8%
(APIServer pid=1) INFO:     192.168.1.2:64972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 06-16 14:12:01 [loggers.py:271] Engine 000: Avg prompt throughput: 249.1 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.41, Accepted throughput: 13.50 tokens/s, Drafted throughput: 19.20 tokens/s, Accepted: 135 tokens, Drafted: 192 tokens, Per-position acceptance rate: 0.854, 0.552, Avg Draft acceptance rate: 70.3%
(APIServer pid=1) INFO 06-16 14:12:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.1%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 30.80 tokens/s, Accepted: 254 tokens, Drafted: 308 tokens, Per-position acceptance rate: 0.974, 0.675, Avg Draft acceptance rate: 82.5%

kneight · June 16, 2026, 7:04am

Love to hear that. For the new image, would any configuration adjustments be needed?

Topic		Replies	Views
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	249	15373	June 18, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	5861	June 15, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1844	May 11, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	1	60	June 18, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	16029	May 18, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	70	6219	June 16, 2026
DeepSeek V4 Flash (1,048,576 Context) on 2x DGX Spark – Custom Sparkrun Recipe DGX Spark / GB10 jetson , deepseek	11	590	June 14, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1291	June 4, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	2770	May 17, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4582	February 13, 2026

DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10

Related topics