You probably answered your question yourself - most likely not need multimodal and work purely as coding/tool call/terminal agent and processing data - 1M tokens of DeepSeek (potential, likely nobody yet achieved actual 1M per session capability on sparks) is very attractive. I honestly don’t know how to work with 256k. My session fills over 256k in one hour. Compaction leads to loss of fidelity, you spend a lot of time explaining it again. In the end it fills session again. Unless you one-shot vibe code, large context is vital.
Just IMO
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers
I’m using Pi for my local mode too. I’ll check out dsv4 in codex when I have a chance next time! I’ve been using Opus for a while now. While I usually stay below 300k ctx, I have never had it halluciate implementation details every since 4.5. Hallucination hasn’t been a problem for me when it comes to coding in any models above 100B in size. Sometimes when a task is badly scoped and vaguely described Claude would pull data from its knowledge based rather than doing web searches but that’s more of a user error from my part. (e.g. Make a plan to build xx stack on my home server and run xxx model).
why not give it a small local model so it can see ? GitHub - stevibe/local-llm-video-captioning · GitHub
Thanks for the input but in what way did I answer my question? In my experience, DSV4 takes more work to set up, runs at a similar speed as its peers, and isn’t as smart. Is it just a snowball effect? More attention → more adoption → more attention?
Most probably don’t need 1M context with local models too. Most recipes I’ve seen people sharing with dsv4 flash uses 262k ctx length, not 1M. And Mimo v2.5 also supports 1M ctx, and is noticably more intelligent with the same quant.
I have a large codebase but I try to make individual files single task and keep a updated project structure file in memory. I rarely reach 262k ctx doing implementation and exploration work with local model, and had never reached 1M with frontier even with lazy prompting.
I’m not saying DSV4 Flash is incapable in any sense, it’s just interesting to me that so many people are willing to commit a lot of effort to this model to make it work slightly better vs running another model.
New shiny object? I tried it, got it running half-decent. Went back to Qwen 3.5 122b, which is faster, handles same 512k, and scores higher on my bench. DS4F is cool, but by no means it’s insane. If anything, it is being extremely cheap in cloud so it works as a backup for a local model very well. And cloud inference definitely can handle over 500k - tested. Even though it scores less than local setup (higher quantization likely).
Last time I tried Mimo 2.5 it was basically unusable past 100k context so I moved on. Minimax M2.7 was very good at times but also struggled a lot once you got to 150k or so context. DSv4 Flash is faster than Mimo and at least as fast if not faster than Minimax, and has the benefit of staying perfectly lucid at least to 300k context and probably beyond. It stays on point and can follow through on detailed plans in a way I rarely saw with Minimax, so despite being less intelligent on paper it feels much more useable imo.
agreed, and it knows how to properly tool call
From my perspective DS4 Flash on 2x Spark has going for it:
- We can use the original weights to reproduce the results of the API, there is no concern that something went wrong during quantization, NVFP4 vs INT4 etc.
- With 1M context, it occupies 102G per Spark. That leaves enough room for other things to run in parallel like image gen or a TTS/STT stack, it’s a very handy size
- The latest recipes posted in this thread are remarkable in how little pp or tg fall off at large context sizes. I have been using it in pi coding agent at ~200k context for hours now, and it feels quite similar to a fresh context. No other models I have tried showed that little degradation of throughput at such a large context. Concurrency also works nicely
- I believe deepseek has announced that the vision input capable update of DS4 Flash will be released some weeks after the current weights, so I expect image input to come
- In terms of quality of results, I have been quite happy, but I have not tried to many alternatives so far (recent spark owner here)
I agree with these statements, but with a caveat - while we have plenty of RAM to have multiple 1M sessions (my setup had like 6M cache), practicality place huge constraint onto it - all depends on pp speed, if it will crawl at 100 t/s pp it will take hours to get any response, as good as have no cache. I observed, that a very large context suffers disproportionately by having to push context back and forth through fabric/qspf56 link. Large context is better on a single spark, if it can handle the model. if only we could get a nemotron-like context handling with deepseek or qwen type of intelligence. But quite likely it is connection - few attention heads in nemotrons, very small tokens, easy to shove around..
DeepSeek-V4-Flash (official FP8) on 4× DGX Spark — TP=4, 500K ctx, b12x, ~70 tok/s single-stream + concurrency results
Just got official DeepSeek-V4-Flash running on 4× DGX Spark (GB10) at TP=4 with 500K context using aidendle94’s b12x-optimized vLLM fork. Sharing full numbers since I hadn’t seen a confirmed 4-node benchmark post yet.
Hardware
- 4× DGX Spark (GB10), 128GB unified memory each
- MikroTik CRS812 switch, each node 2×200G RoCE (400G ports broken out)
- RoCE / NCCL over CX-7 NICs
Software
- Image:
aidendle94/sparkrun-vllm-ds4-gb10:production-ready(b12x branch) - vLLM: v0.21.1rc1.dev339+g1967a5627bc3
Key launch flags
--tensor-parallel-size 4
--max-model-len 500000
--max-num-seqs 8
--block-size 256
--gpu-memory-utilization 0.8
--kv-cache-dtype fp8
--distributed-executor-backend mp
--compilation-config {"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}
--speculative-config {"method":"mtp","num_speculative_tokens":2}
--enable-flashinfer-autotune
--enable-prefix-caching
--enable-chunked-prefill
Environment: NCCL_NET=IB, NCCL_IB_DISABLE=0, NCCL_IB_GID_INDEX=0, VLLM_USE_B12X_MOE=1.
Single-stream decode (short prompt, 200 output tokens, server-side)
~70 tok/s sustained with 83% MTP draft acceptance rate (mean acceptance length 2.66).
Prefill throughput (single request, varying context)
| Context | Prompt tokens | Time | Prefill tok/s | TTFT |
|---|---|---|---|---|
| 4K | 705 | 7.05s | 100* | 7.05s |
| 32K | 5,505 | 6.81s | 809* | 6.81s |
| 128K | 22,005 | 11.05s | 1,991 | 11.05s |
| 256K | 44,005 | 13.25s | 3,322 | 13.25s |
| 500K | 85,005 | 21.25s | 4,000 | 21.25s |
*Short contexts are network-latency-dominated from client-side measurement; server-side prefill is faster.
Decode concurrency (200 output tokens per request, client-side measured)
| C | Total output | Time | Aggregate tok/s | Per-request avg |
|---|---|---|---|---|
| 1 | 200 | 8.70s | 23.0 | 23.0 |
| 2 | 400 | 10.11s | 39.6 | 19.8 |
| 4 | 800 | 12.79s | 62.5 | 15.6 |
| 8 | 1,600 | 13.55s | 118.1 | 14.8 |
Prefill concurrency (128K context, 5 output tokens, client-side measured)
| C | Total prompt | Time | Aggregate tok/s | Avg TTFT |
|---|---|---|---|---|
| 1 | 22K | 5.35s | 4,111 | 5.35s |
| 2 | 44K | 5.58s | 7,894 | 2.79s |
| 4 | 88K | 5.83s | 15,106 | 1.46s |
KV cache capacity:
5.3M tokens (~10 concurrent 500K requests safely)
Cold start time (first run):
- Model load: ~160s
- DeepGEMM warmup: ~2min
- TileLang + FlashInfer autotune: ~37s
- Total: ~3-4 min (subsequent starts ~40s with cache)
Comparison to known 2-node results:
The 2-node recipe showed ~42 tok/s single-stream decode and ~2000 tok/s prefill (short ctx). On 4 nodes we see:
- Decode: ~70 tok/s (1.66× improvement, expected sub-linear due to MoE all-to-all overhead)
- Prefill 128K: ~1991 tok/s (matched at equal context)
- Prefill 500K: 4000 tok/s (long-context prefills benefit from more TP shards)
- Concurrency scaling is near-linear for prefill up to C=4 (15K aggregate)
Gotchas encountered:
- GID index on RoCE: NCCL defaulted to GID index 3 which was empty. Fix:
NCCL_IB_GID_INDEX=0. - NCCL_NET=IB required: Without it, pip-distributed NCCL won’t use RoCE, causing
ibv_modify_qpfailures. - Missing
/workspace: aidendle94’s image doesn’t have the WORKDIR that the launch script expects. Addeddocker exec mkdir -p /workspace. - Persistent
GID table changedWARN: RoCE interface generates netlink events during runtime. Does not affect functionality; suppress withsysctl net.ipv6.conf.roce*.accept_ra=0if desired.
Summary:
aidendle94’s b12x fork on 4× DGX Spark delivers solid performance: ~70 tok/s single-stream decode, 4000 tok/s prefill at 500K context, and near-linear prefill concurrency scaling up to 15K tok/s aggregate with sub-1.5s TTFT. The limiting factor is clearly MoE all-to-all cross-node communication, not compute.
How are you running a 4x cluster I’m thinking about it down the road
I’m using a MikroTik CRS812 switch with 400G ports broken out to 2×200G per node. The rest is essentially the same recipe as yours but with 4 nodes, TP=4, and aidendle94’s Docker image instead of building from source. Key fix: NCCL_IB_GID_INDEX=0 was needed for RoCE on the CRS812.
So can you run stuff like GLM ? Might need to DM you
there are multiple PR for deepseek v4 on sm12x, @jasl do you know when vLLM plans to add your PR to main branch? I think you said you contacted vLLM team long time back and they planned to first make deepseek v4 works on datacenter gpu, and later comeback to sm12x, and seems like vLLM v0.22.0 largely closed datacenter gpu support.
What is everyones working launch command and recipe? I keep having both nodes use 93gb of memory each the second I run the launch command and by the time it says it finished loading the weights I get hit with 130gb memory use by each node then full system lock up to the point I have to hard restart each node. Even changed the max memory use down to 0.5 did not help at all.
I was able to fully replicate a reddit recipe and results are outstanding. I will create a new thread to have it more visible, it worth it
Posted here: DeepSeek v4 Flash (Aiden Recipe from Reddit) - 1M token session operational, Cuda 12.1 tailored for DGX Spark GB10
They probably do not have direct support for SM12x, unfortunately.
Tried to get it working for 4 hours and when it finally loaded up all responses are incoherent.
Which driver version are you using? I’ve tried the exact same recipes, but I always run into OOM (Out of Memory) errors. My current driver version is 580.159.03.
How can I upgrade my drivers, and is there a proper/recommended way to do it?
thanks for sharing your work! I’m currently trying to deploy DeepSeek v4 on my 2-Spark Ray cluster(GB10) following the official documentation. I’ve set up the Ray cluster and attempted deployment using the deepseekv4-arm64-cu130 vLLM image, but encountered an error: “RuntimeError: call, /opt/venv/lib/python3.12/site-packages/torch/include/torch/csrc/stable/stableivalue_conversions.h:544, Not yet supported ScalarType 44, please file an issue describing your use case.”.
I noticed that most other deployments mentioned in this thread are running with no Ray, so I’m unsure whether the vLLM image and Docker branch referenced here would actually work in a Ray cluster environment. Could you share some details on how you implemented model deployment DeepSeekv4 Flash on Ray? Any insights or configuration tips would be greatly appreciated. Thanks in advance!