DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10

0rand · June 4, 2026, 8:42am

First - credits where credits due to post by @tonyd615 and @11_p who pointed me to a right direction: DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - #174 by 11_p as more so to original creator of recipe @aidendle94 and reddit post Reddit - Please wait for verification

By doing this post I am eating my recent words that we haven’t seen 1M operational sessions on Spark besides Nemotrons, now we are. Proven, tested, operational. Massive breakthrough.

================== Full write-up =================

DeepSeek V4 Flash at 1M Context on Dual DGX Spark/Atom AI Top — Working Recipe

After a week of trial and error, I finally have a stable DeepSeek V4 Flash deployment running at 1M context across two DGX Spark (Gigabyte Atom AI Top) nodes with b12x MoE kernels. Sharing the recipe and benchmarks so others don’t hit the same dead ends I did.

TL;DR

Image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready — pre-built with b12x MoE, CUDA 12.1, vLLM 0.21.1
Stack: Docker Compose, TP=2, no Ray, PyTorch distributed backend
Result: 30-45 t/s decode, 1M context, zero fabric delays, 89/100 tool-calling

Hardware

2x DGX Spark / Gigabyte Atom AI Top (GB10, SM121, 128GB unified memory each)
ConnectX-7 200Gbps RoCE direct cable (QSFP56)
Head and worker on same subnet (192.168.0.0/24)

Dead Ends Avoided

lmxxf/vllm-deepseek-v4-dgx-spark: FP4 Marlin backend, not compatible with FP8 weights
Manual PR 40082 build (--apply-vllm-pr 40082 on vLLM main): FlashInfer/cutlass version mismatch → AttributeError: module 'cutlass.cute.nvgpu' has no attribute 'OperandMajorMode'
Standard spark-vllm-docker builds (dsv4-d568-cherry-sched): No b12x MoE support — VLLM_USE_B12X_MOE=1 env var was unrecognized, 2x slower prefill

Docker Compose Configuration

# compose.yaml
services:
  vllm:
    image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready
    network_mode: host
    ipc: host
    shm_size: "64gb"
    ulimits:
      memlock: -1
      stack: 67108864
    gpus: all
    devices:
      - /dev/infiniband:/dev/infiniband
    volumes:
      - ${HF_CACHE:-${HOME}/.cache/huggingface}:/cache/huggingface
      - /etc/passwd:/etc/passwd:ro
      - /etc/group:/etc/group:ro
    environment:
      HF_HOME: /cache/huggingface
      HF_HUB_OFFLINE: "1"
      VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
      VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
      VLLM_USE_B12X_MOE: "1"
      VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "256"
      VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/.../libnccl.so.2
      TORCH_CUDA_ARCH_LIST: "12.1a"
      FLASHINFER_CUDA_ARCH_LIST: "12.1a"
      NCCL_NET: IB
      NCCL_IB_DISABLE: "0"
      NCCL_IB_HCA: "rocep1s0f0,roceP2p1s0f0"     # EDIT: Your NICs
      NCCL_SOCKET_IFNAME: "enP7s7,enp1s0f0np0"    # EDIT: Your NICs
      NCCL_IB_GID_INDEX: "3"
      NCCL_CROSS_NIC: "1"
      NCCL_CUMEM_ENABLE: "0"
      NCCL_IGNORE_CPU_AFFINITY: "1"
      NCCL_DEBUG: WARN
      NODE_RANK: "${NODE_RANK:?set 0 on head, 1 on worker}"
      HEADLESS: "${HEADLESS:-}"
      MASTER_ADDR: "${MASTER_ADDR:?head-node IP}"
    command:
      - bash
      - -lc
      - >
        exec /usr/local/bin/dsv4-vllm-entrypoint serve deepseek-ai/DeepSeek-V4-Flash
        --served-model-name deepseek-v4-flash
        --host 0.0.0.0 --port 8000 --trust-remote-code
        --tensor-parallel-size 2 --pipeline-parallel-size 1
        --kv-cache-dtype fp8 --block-size 256
        --max-model-len 1000000 --max-num-seqs 6
        --max-num-batched-tokens 8192 --gpu-memory-utilization 0.82
        --enable-prefix-caching
        --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
        --tokenizer-mode deepseek_v4
        --distributed-executor-backend mp
        --tool-call-parser deepseek_v4 --enable-auto-tool-choice
        --reasoning-parser deepseek_v4
        --reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"","reasoning_end_str":""}'
        --default-chat-template-kwargs.thinking=true
        --default-chat-template-kwargs.reasoning_effort=high
        --enable-flashinfer-autotune
        --nnodes 2 --node-rank ${NODE_RANK}
        --master-addr ${MASTER_ADDR} --master-port 25000
        ${HEADLESS:+--headless}

Per-Node .env Files

Head node (rank 0)

NODE_RANK=0
HEADLESS=
MASTER_ADDR=192.168.0.8       # EDIT: Your head node CX7 IP
HF_CACHE=/home/user/.cache/huggingface

Worker node (rank 1)

NODE_RANK=1
HEADLESS=1
MASTER_ADDR=192.168.0.8       # EDIT: Your head node CX7 IP (same as head)
HF_CACHE=/home/user/.cache/huggingface

Launch Order

Worker first, then Head.

# Terminal 1 — Worker node
cd /home/user/ds4f-aiden-docker
docker compose up -d

# Wait ~10s, then Terminal 2 — Head node
cd /home/user/ds4f-aiden-docker
docker compose up -d

NCCL Fix

Critical: Without shm_size: 64gb and ulimits memlock=-1, you’ll hit:

NCCL error: unhandled system error
Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory

This is a locked memory limit issue. The shm_size and memlock settings fix it.

Performance Results

Single Request (pp1024, tg128)

Context	Prefill t/s	Decode t/s	TTFT
0	1,188	45.7	1s
240K	1,710	39.4	2.4m
384K	1,510	36.4	4.3m
512K	1,374	36.1	6.2m
720K	1,187	35.0	10.1m
980K	986	30.4	16.6m

Key observations:

MLA KV cache is ~2% of GQA — zero fabric delays even at 980K
No YaRN tax — decode drops only 23% from 0 to 980K (45→30 t/s)
For comparison, Qwen 3.5-122B at 256K+: decode collapses from 40→15 t/s

Concurrency (pp2048, tg128)

Config	Depth	Prefill t/s	Decode t/s	TTFT
c1	d0	1,942	36.5	1.2s
c2	d0	1,843	54.4	2.2s
c4	d0	1,883	47.8	3.7s
c1	d4K	2,090	38.5	3.1s
c2	d4K	2,028	38.3	5.0s

Sweet spot: c2 at zero context yields 54.4 t/s decode.

Tool-Calling Quality

tool-eval-bench v1.8.0 — 89/100

Category	Score
Tool Selection	100%
Parameter Precision	83%
Multi-Step Chains	75%
Restraint & Refusal	100%
Error Recovery	100%
Structured Output	100%
Overall	89/100

IFEval

Instruction-level: 88.2%
Prompt-level: 84.8%

Acknowledgements

Original Docker image by aidendle94 — thank you for the pre-built b12x image
Built on the shoulders of: jasl/vllm fork, lukealonso/b12x kernels, eugr/spark-vllm-docker toolchain
NVIDIA Developer Forums community for debugging NCCL issues

truetotosse · June 4, 2026, 9:07am

Did you try with MTP 3?

0rand · June 4, 2026, 9:09am

Tried on a different version, 2 is optimal. Did not change on this. You welcome to test and post results :)

truetotosse · June 4, 2026, 9:10am

I am only considering the 2nd spark, can’t try :(

0rand · June 4, 2026, 9:11am

Well then take it for what it is worth. I consider it be a massive leap. Spark x 2 was totally worth it before with older models performing admirably over fabric, now DS4F that can actually run 1M context - this is exceptional. IMO.

truetotosse · June 4, 2026, 9:13am

Yes, this one is indeed sounds promising. Other options I saw for 2x spark were kinda slow, not really worth it.

0rand · June 4, 2026, 9:14am

I was running Qwen 3.5 122b - it was rock solid close to 50 t/s. Only downside - even with YaRN pushing past 384k was dreadfully slow, 500k was a limit - fabric/ROCe would collapse under load after that.

I mean if you only look from cost perspective in the moment - it is NOT WORTH it, especially if you happy with DS4. It’s dirt cheap in the cloud - Sparks will degrade much faster than you get close to breakeven. However, you avoid vendor trap, price gauging, data spying, insecurity, inability to tune model, SFT, control system prompt, avoid low quantization traps. If you are running business and dependent on AI - worth it. if you are developer just looking for cheap inference - not really. Chinese competition is very strong, I doubt that API pricing across planet will rise, datacenter prices to rent boxes like H100 or GH200 actually going down. US firms definitely price-gauge, same as EU (Mistral, looking at you), but unless you are in defense, gov, public-adjacent health or law - you can’t care less.

My own case - I use it mostly as a test-bed to try and test technology, use it for inference and local Hermes agents working with large Karpathy-WIKIs, SFT and LoRA corpuse building (not sure if I will train on GB10-s) and MVP-ing my quantitative market trading solutions. However, for production I will just rent GH200 in Chicago as my systems are collocated in Aurora next the CME. Auto-deploy, 4 hours a day at 2.5 usd per hour => 21 x 4 x 2.5 = laughable 210 usd/month, comparable to daily commissions cost. But this is my use case.

truetotosse · June 4, 2026, 9:16am

But for 122b you don’t need the second spark. Also from my experience I see no diff between 3.5 122 and 3.6 35 in quality, but 35b is faster

0rand · June 4, 2026, 9:21am

You definitely can run 122b on one, but one two its 60% faster and have more cache. 35b - I would not debate that. To each individual their own tastes and standards. Happy for you to be happy. For me 122b beats even 3.6 27b on every task, not just speed. Qwen is known to benchmaxx new models, do you own diligence and testing.

jduggins · June 5, 2026, 12:25am

I ran into two problems with this, 1 because I was ignorant and 1 because I have a different setup.

I’d forgotten about the ibdev2netdev command to output which interfaces were up. That command told me what to use for the interfaces.

The other problem is that I don’t have my nodes network connected via ethernet. I attached to a simple container with the image, pulled the env and saw variables for the ethernet interface. I updated them in my docker-compose.yml file like this.
TP_SOCKET_IFNAME: “wlP9s9”
MN_IF_NAME: “wlP9s9”
GLOO_SOCKET_IFNAME: “wlP9s9”
OMPI_MCA_btl_tcp_if_include: “wlP9s9”

Thank you for putting in the effort to compile the data and get the compose shared.

cormac.garvey1 · June 5, 2026, 7:54am

Thanks for sharing! Im way out over my skis trying to get DS4 running on 2 sparks of a 3 node cluster, burning through Claude credits, so any help like this is just so appreciated. :-)

0rand · June 5, 2026, 9:25am

Sign up for deepseek v4 pro on deepseek.ai and spend 50x less money than on Claude. I have put 10 bucks in and still have 6 left after weeks of use. Now switched to local v4 flash of course. I have it in Hermes and the article above is basically his job as well as researching GitHub, peeking into vllm docker layers, cross-checking on reddit etc. I used flash and pro intermittently. Both are more than capable. And then we built kernels, cherry picked commits, deployed cluters, found nccl issues. I would have bailed if I was doing it myself (not my core job). This is what people don’t get about AI productivity - it makes failure very cheap, thus allowing you to probe and test way more solutions you would never try as it has 10% success but cost of weeks of work. And sufficient amount of they pay off. Totally worth it.

tocs704 · June 5, 2026, 4:12pm

This is sick. Thanks for all the work.

Can we get this in a recipe file for Spark VLLM Docker? @eugr_nv @eugr

eugr_nv · June 5, 2026, 4:15pm

Yes, it’s in my list :)

spinnakerwind · June 5, 2026, 4:44pm

May need to order another DGX Spark, just to try this out :)

0rand · June 5, 2026, 6:07pm

Sign up for free to opencode zen and try, they give trial every day for hours at insane speed. Ds4 flash only, no pro. I usually started for free, half day, then switch to paid directly to deepseek.ai endpoint, dirt cheap. In few hour free is back. :)

aidendle94 · June 5, 2026, 8:56pm

Cheers

ekkis · June 6, 2026, 4:27am

I’m surprised that none of these improvements are being ported back to vllm main via PRs by lukealonso and the others that made this image possible, the performance difference is massive. Anyone know why that is?

stu.miller · June 6, 2026, 7:08pm

Just wanted to pitch in with my tests on this today.

Using this recipe, DeepSeek-V4-Flash running across two DGX Sparks using the aidendle94/sparkrun-vllm-ds4-gb10:production-ready image (thanks for the recipe). Sharing real measured throughput + a few deployment notes for anyone else attempting this.

Setup

2× GB10, 128 GB unified memory each, TP=2.
Interconnect: one QSFP56 cable over the ConnectX-7.
vLLM 0.21.1, --max-model-len 1000000 (full 1M served fine), MTP speculative decoding on (num_speculative_tokens=2).

Throughput (llama-benchy, pp2048 / tg128, single node client):

┌──────────────────────────┬──────────┐
│ Test │ t/s │
├──────────────────────────┼──────────┤
│ Prefill @ depth 0, c1 │ 1,574 │
├──────────────────────────┼──────────┤
│ Prefill @ depth 8192, c1 │ 1,586 │
├──────────────────────────┼──────────┤
│ Generation @ d0, c1 │ 35.6 │
├──────────────────────────┼──────────┤
│ Generation @ d0, c4 │ 63.9 │
├──────────────────────────┼──────────┤
│ Generation @ d4096, c4 │ 30.8 │
├──────────────────────────┼──────────┤
│ Generation @ d8192, c4 │ 23.5 │
├──────────────────────────┼──────────┤
│ TTFT @ d0, c1 │ 1,276 ms │
└──────────────────────────┴──────────┘

Takeaways:

Prefill is excellent and essentially flat across context depth (~1,580 t/s from d0 to d8192) the long-context story holds up well on this hardware.
Single-stream generation ~36 t/s. Scales to ~64 t/s at c4/shallow but degrades under concurrency × depth (23.5 t/s at d8192 c4). Expected MoE-over-two-boxes behaviour — generation is gated by the cross-node NCCL hop on every token.
reasoning_effort (high vs max) made no difference to raw token rate, as expected.

Quality (sanity check, tool-calling bench, 69 scenarios): scores in the same band as our other production models. Best autonomous planner of anything we’ve run locally; strong structured reasoning/output.

Running it at the official sampling (temp 1.0 / top_p 1.0, thinking on) and found reasoning_effort: high beats max for agentic/tool work — max added latency and slightly regressed structured-output/safety.

Overall I’m super impressed. Early days, but this thing really does feel sonnet adjacent in chat. Big test will come next week as it replaces 122b in my hermes kanban profile roles.

MiaAI_Lab · June 7, 2026, 7:27pm

OK I am at shocked how well this works! Speechless!!!

Works so well on my two sparks it’s pretty insane!!!

I am getting 40-45 tok/sec, insane!!

Topic		Replies	Views
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	260	22510	July 15, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	71	7089	June 15, 2026
DeepSeek-V4-Flash on 4× DGX Spark via vLLM (jasl fork, TP=4, RDMA, MTP) — 49–54 tok/s single-stream, full recipe + the traps DGX Spark / GB10 Projects deepseek	3	769	June 19, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	2199	May 11, 2026
Instructions for running Deepseek-v4-flash with DSpark using Eugr's repo DGX Spark / GB10 Projects deepseek	10	1222	July 16, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	17624	May 18, 2026
Official NVidia optimized DeepSeek-V4-Flash models? DGX Spark / GB10 deepseek	28	2121	July 11, 2026
DeepSeek V4 Flash (1,048,576 Context) on 2x DGX Spark – Custom Sparkrun Recipe DGX Spark / GB10 jetson , deepseek	11	1165	June 14, 2026
DeepSeek-V4-Flash-DSpark on 2× DGX Spark (GB10) — big single-stream speed boost (~60-67 tok/s) + 1M context, now with concurrency DGX Spark / GB10 deepseek	101	7562	July 19, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	79	9292	July 15, 2026