DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10

First - credits where credits due to post by @tonyd615 and @11_p who pointed me to a right direction: DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers - #174 by 11_p as more so to original creator of recipe @aidendle94 and reddit post Reddit - Please wait for verification

By doing this post I am eating my recent words that we haven’t seen 1M operational sessions on Spark besides Nemotrons, now we are. Proven, tested, operational. Massive breakthrough.

================== Full write-up =================

DeepSeek V4 Flash at 1M Context on Dual DGX Spark/Atom AI Top — Working Recipe

After a week of trial and error, I finally have a stable DeepSeek V4 Flash deployment running at 1M context across two DGX Spark (Gigabyte Atom AI Top) nodes with b12x MoE kernels. Sharing the recipe and benchmarks so others don’t hit the same dead ends I did.

TL;DR

  • Image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready — pre-built with b12x MoE, CUDA 12.1, vLLM 0.21.1
  • Stack: Docker Compose, TP=2, no Ray, PyTorch distributed backend
  • Result: 30-45 t/s decode, 1M context, zero fabric delays, 89/100 tool-calling

Hardware

  • 2x DGX Spark / Gigabyte Atom AI Top (GB10, SM121, 128GB unified memory each)
  • ConnectX-7 200Gbps RoCE direct cable (QSFP56)
  • Head and worker on same subnet (192.168.0.0/24)

Dead Ends Avoided

  1. lmxxf/vllm-deepseek-v4-dgx-spark: FP4 Marlin backend, not compatible with FP8 weights
  2. Manual PR 40082 build (--apply-vllm-pr 40082 on vLLM main): FlashInfer/cutlass version mismatch → AttributeError: module 'cutlass.cute.nvgpu' has no attribute 'OperandMajorMode'
  3. Standard spark-vllm-docker builds (dsv4-d568-cherry-sched): No b12x MoE support — VLLM_USE_B12X_MOE=1 env var was unrecognized, 2x slower prefill

Docker Compose Configuration

# compose.yaml
services:
  vllm:
    image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready
    network_mode: host
    ipc: host
    shm_size: "64gb"
    ulimits:
      memlock: -1
      stack: 67108864
    gpus: all
    devices:
      - /dev/infiniband:/dev/infiniband
    volumes:
      - ${HF_CACHE:-${HOME}/.cache/huggingface}:/cache/huggingface
      - /etc/passwd:/etc/passwd:ro
      - /etc/group:/etc/group:ro
    environment:
      HF_HOME: /cache/huggingface
      HF_HUB_OFFLINE: "1"
      VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache
      VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
      VLLM_USE_B12X_MOE: "1"
      VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "256"
      VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/.../libnccl.so.2
      TORCH_CUDA_ARCH_LIST: "12.1a"
      FLASHINFER_CUDA_ARCH_LIST: "12.1a"
      NCCL_NET: IB
      NCCL_IB_DISABLE: "0"
      NCCL_IB_HCA: "rocep1s0f0,roceP2p1s0f0"     # EDIT: Your NICs
      NCCL_SOCKET_IFNAME: "enP7s7,enp1s0f0np0"    # EDIT: Your NICs
      NCCL_IB_GID_INDEX: "3"
      NCCL_CROSS_NIC: "1"
      NCCL_CUMEM_ENABLE: "0"
      NCCL_IGNORE_CPU_AFFINITY: "1"
      NCCL_DEBUG: WARN
      NODE_RANK: "${NODE_RANK:?set 0 on head, 1 on worker}"
      HEADLESS: "${HEADLESS:-}"
      MASTER_ADDR: "${MASTER_ADDR:?head-node IP}"
    command:
      - bash
      - -lc
      - >
        exec /usr/local/bin/dsv4-vllm-entrypoint serve deepseek-ai/DeepSeek-V4-Flash
        --served-model-name deepseek-v4-flash
        --host 0.0.0.0 --port 8000 --trust-remote-code
        --tensor-parallel-size 2 --pipeline-parallel-size 1
        --kv-cache-dtype fp8 --block-size 256
        --max-model-len 1000000 --max-num-seqs 6
        --max-num-batched-tokens 8192 --gpu-memory-utilization 0.82
        --enable-prefix-caching
        --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
        --tokenizer-mode deepseek_v4
        --distributed-executor-backend mp
        --tool-call-parser deepseek_v4 --enable-auto-tool-choice
        --reasoning-parser deepseek_v4
        --reasoning-config '{"reasoning_parser":"deepseek_v4","reasoning_start_str":"","reasoning_end_str":""}'
        --default-chat-template-kwargs.thinking=true
        --default-chat-template-kwargs.reasoning_effort=high
        --enable-flashinfer-autotune
        --nnodes 2 --node-rank ${NODE_RANK}
        --master-addr ${MASTER_ADDR} --master-port 25000
        ${HEADLESS:+--headless}

Per-Node .env Files

Head node (rank 0)

NODE_RANK=0
HEADLESS=
MASTER_ADDR=192.168.0.8       # EDIT: Your head node CX7 IP
HF_CACHE=/home/user/.cache/huggingface

Worker node (rank 1)

NODE_RANK=1
HEADLESS=1
MASTER_ADDR=192.168.0.8       # EDIT: Your head node CX7 IP (same as head)
HF_CACHE=/home/user/.cache/huggingface

Launch Order

Worker first, then Head.

# Terminal 1 — Worker node
cd /home/user/ds4f-aiden-docker
docker compose up -d

# Wait ~10s, then Terminal 2 — Head node
cd /home/user/ds4f-aiden-docker
docker compose up -d

NCCL Fix

Critical: Without shm_size: 64gb and ulimits memlock=-1, you’ll hit:

NCCL error: unhandled system error
Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory

This is a locked memory limit issue. The shm_size and memlock settings fix it.

Performance Results

Single Request (pp1024, tg128)

Context Prefill t/s Decode t/s TTFT
0 1,188 45.7 1s
240K 1,710 39.4 2.4m
384K 1,510 36.4 4.3m
512K 1,374 36.1 6.2m
720K 1,187 35.0 10.1m
980K 986 30.4 16.6m

Key observations:

  • MLA KV cache is ~2% of GQA — zero fabric delays even at 980K
  • No YaRN tax — decode drops only 23% from 0 to 980K (45→30 t/s)
  • For comparison, Qwen 3.5-122B at 256K+: decode collapses from 40→15 t/s

Concurrency (pp2048, tg128)

Config Depth Prefill t/s Decode t/s TTFT
c1 d0 1,942 36.5 1.2s
c2 d0 1,843 54.4 2.2s
c4 d0 1,883 47.8 3.7s
c1 d4K 2,090 38.5 3.1s
c2 d4K 2,028 38.3 5.0s

Sweet spot: c2 at zero context yields 54.4 t/s decode.

Tool-Calling Quality

tool-eval-bench v1.8.0 — 89/100

Category Score
Tool Selection 100%
Parameter Precision 83%
Multi-Step Chains 75%
Restraint & Refusal 100%
Error Recovery 100%
Structured Output 100%
Overall 89/100

IFEval

  • Instruction-level: 88.2%
  • Prompt-level: 84.8%

Acknowledgements

  • Original Docker image by aidendle94 — thank you for the pre-built b12x image
  • Built on the shoulders of: jasl/vllm fork, lukealonso/b12x kernels, eugr/spark-vllm-docker toolchain
  • NVIDIA Developer Forums community for debugging NCCL issues

Did you try with MTP 3?

Tried on a different version, 2 is optimal. Did not change on this. You welcome to test and post results :)

I am only considering the 2nd spark, can’t try :(

Well then take it for what it is worth. I consider it be a massive leap. Spark x 2 was totally worth it before with older models performing admirably over fabric, now DS4F that can actually run 1M context - this is exceptional. IMO.

Yes, this one is indeed sounds promising. Other options I saw for 2x spark were kinda slow, not really worth it.

I was running Qwen 3.5 122b - it was rock solid close to 50 t/s. Only downside - even with YaRN pushing past 384k was dreadfully slow, 500k was a limit - fabric/ROCe would collapse under load after that.

I mean if you only look from cost perspective in the moment - it is NOT WORTH it, especially if you happy with DS4. It’s dirt cheap in the cloud - Sparks will degrade much faster than you get close to breakeven. However, you avoid vendor trap, price gauging, data spying, insecurity, inability to tune model, SFT, control system prompt, avoid low quantization traps. If you are running business and dependent on AI - worth it. if you are developer just looking for cheap inference - not really. Chinese competition is very strong, I doubt that API pricing across planet will rise, datacenter prices to rent boxes like H100 or GH200 actually going down. US firms definitely price-gauge, same as EU (Mistral, looking at you), but unless you are in defense, gov, public-adjacent health or law - you can’t care less.

My own case - I use it mostly as a test-bed to try and test technology, use it for inference and local Hermes agents working with large Karpathy-WIKIs, SFT and LoRA corpuse building (not sure if I will train on GB10-s) and MVP-ing my quantitative market trading solutions. However, for production I will just rent GH200 in Chicago as my systems are collocated in Aurora next the CME. Auto-deploy, 4 hours a day at 2.5 usd per hour => 21 x 4 x 2.5 = laughable 210 usd/month, comparable to daily commissions cost. But this is my use case.

But for 122b you don’t need the second spark. Also from my experience I see no diff between 3.5 122 and 3.6 35 in quality, but 35b is faster

You definitely can run 122b on one, but one two its 60% faster and have more cache. 35b - I would not debate that. To each individual their own tastes and standards. Happy for you to be happy. For me 122b beats even 3.6 27b on every task, not just speed. Qwen is known to benchmaxx new models, do you own diligence and testing.

I ran into two problems with this, 1 because I was ignorant and 1 because I have a different setup.

I’d forgotten about the ibdev2netdev command to output which interfaces were up. That command told me what to use for the interfaces.

The other problem is that I don’t have my nodes network connected via ethernet. I attached to a simple container with the image, pulled the env and saw variables for the ethernet interface. I updated them in my docker-compose.yml file like this.
TP_SOCKET_IFNAME: “wlP9s9”
MN_IF_NAME: “wlP9s9”
GLOO_SOCKET_IFNAME: “wlP9s9”
OMPI_MCA_btl_tcp_if_include: “wlP9s9”

Thank you for putting in the effort to compile the data and get the compose shared.

Thanks for sharing! Im way out over my skis trying to get DS4 running on 2 sparks of a 3 node cluster, burning through Claude credits, so any help like this is just so appreciated. :-)

Sign up for deepseek v4 pro on deepseek.ai and spend 50x less money than on Claude. I have put 10 bucks in and still have 6 left after weeks of use. Now switched to local v4 flash of course. I have it in Hermes and the article above is basically his job as well as researching GitHub, peeking into vllm docker layers, cross-checking on reddit etc. I used flash and pro intermittently. Both are more than capable. And then we built kernels, cherry picked commits, deployed cluters, found nccl issues. I would have bailed if I was doing it myself (not my core job). This is what people don’t get about AI productivity - it makes failure very cheap, thus allowing you to probe and test way more solutions you would never try as it has 10% success but cost of weeks of work. And sufficient amount of they pay off. Totally worth it.

This is sick. Thanks for all the work.

Can we get this in a recipe file for Spark VLLM Docker? @eugr_nv @eugr

Yes, it’s in my list :)

May need to order another DGX Spark, just to try this out :)

Sign up for free to opencode zen and try, they give trial every day for hours at insane speed. Ds4 flash only, no pro. I usually started for free, half day, then switch to paid directly to deepseek.ai endpoint, dirt cheap. In few hour free is back. :)

Cheers

I’m surprised that none of these improvements are being ported back to vllm main via PRs by lukealonso and the others that made this image possible, the performance difference is massive. Anyone know why that is?

Just wanted to pitch in with my tests on this today.

Using this recipe, DeepSeek-V4-Flash running across two DGX Sparks using the aidendle94/sparkrun-vllm-ds4-gb10:production-ready image (thanks for the recipe). Sharing real measured throughput + a few deployment notes for anyone else attempting this.

Setup

  • 2× GB10, 128 GB unified memory each, TP=2.
  • Interconnect: one QSFP56 cable over the ConnectX-7.
  • vLLM 0.21.1, --max-model-len 1000000 (full 1M served fine), MTP speculative decoding on (num_speculative_tokens=2).

Throughput (llama-benchy, pp2048 / tg128, single node client):

┌──────────────────────────┬──────────┐
│ Test │ t/s │
├──────────────────────────┼──────────┤
│ Prefill @ depth 0, c1 │ 1,574 │
├──────────────────────────┼──────────┤
│ Prefill @ depth 8192, c1 │ 1,586 │
├──────────────────────────┼──────────┤
│ Generation @ d0, c1 │ 35.6 │
├──────────────────────────┼──────────┤
│ Generation @ d0, c4 │ 63.9 │
├──────────────────────────┼──────────┤
│ Generation @ d4096, c4 │ 30.8 │
├──────────────────────────┼──────────┤
│ Generation @ d8192, c4 │ 23.5 │
├──────────────────────────┼──────────┤
│ TTFT @ d0, c1 │ 1,276 ms │
└──────────────────────────┴──────────┘

Takeaways:

  • Prefill is excellent and essentially flat across context depth (~1,580 t/s from d0 to d8192) the long-context story holds up well on this hardware.
  • Single-stream generation ~36 t/s. Scales to ~64 t/s at c4/shallow but degrades under concurrency × depth (23.5 t/s at d8192 c4). Expected MoE-over-two-boxes behaviour — generation is gated by the cross-node NCCL hop on every token.
  • reasoning_effort (high vs max) made no difference to raw token rate, as expected.

Quality (sanity check, tool-calling bench, 69 scenarios): scores in the same band as our other production models. Best autonomous planner of anything we’ve run locally; strong structured reasoning/output.

Running it at the official sampling (temp 1.0 / top_p 1.0, thinking on) and found reasoning_effort: high beats max for agentic/tool work — max added latency and slightly regressed structured-output/safety.

Overall I’m super impressed. Early days, but this thing really does feel sonnet adjacent in chat. Big test will come next week as it replaces 122b in my hermes kanban profile roles.

OK I am at shocked how well this works! Speechless!!!

Works so well on my two sparks it’s pretty insane!!!

I am getting 40-45 tok/sec, insane!!