DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

tonyd615 · May 17, 2026, 3:49pm

Trying this now, and serapis I used your recipe for Minimax 2.7 in the past. Will you say DS4 Flash is “better” I feel like this model is faster and it does a better job on the agent side for me. It knows where to look and getting the job done, also its not stuck behind a license either.

serapis · May 17, 2026, 6:03pm

I’d argue it is more capable. I personally look forward to them releasing the vision-capable version of the model and hope that we have more improvements in vLLM by then.

I was able to run it with 512K today but it got incredibly slow. @jasl thanks for continuing to improve DSV4 on vLLM for us. Any hunch if any of your work will be merged anytime soon or will we have to continue via workarounds such as building from forks etc?

tonyd615 · May 17, 2026, 6:04pm

sera so as of right now the best recipe is the one you posted above ? I’d like to try it if so

serapis · May 17, 2026, 6:14pm

I won’t claim it’s the best recipe. I need to run it much longer to understand the quirks. I am sure there is still headroom for optimization.

jasl · May 17, 2026, 6:15pm

I have a short conversation with them
TD; LR;
I work for myself and people who trust me.
I don’t care what the vLLM team thought.

Unfortunately, I can’t test 512K for now, but I will try to improve the prefill performance and personal use cases. I have collected metrics this weekend, and will forward the work next week.

rkiles · May 18, 2026, 3:31pm

Thank you so much for putting this info out here. I’ve been running DeepSeek v4 Flash on my dual node cluster for a few days now and it’s been doing really well. For anyone who’s interested in running MAX thinking mode, you can add the reasoning_effort parameter to the --default-chat-template-kwargs flag as shown below. I probably won’t be using the MAX setting for everything as it does add a considerable amount of time, but it did help with refactoring one of my more complex websites to a Flutter app.

–default-chat-template-kwargs ‘{{“thinking”:true, “preserve_thinking”:true, “reasoning_effort”:“max”}}’

0rand · May 18, 2026, 5:32pm

I am waiting for nvFP4 quant, if it ever arrives. Anything I tested running in FP8 is insanely slow. Memory bandwidth limits, only worthened by cx7 in a cluster config.

p33zy · May 18, 2026, 9:40pm

There’s already an NVFP4 quant from Redhat, but DeepSeek also released many layers in 4 bit quant hence the small size. Decode speed doesn’t seem to be that bad at the moment - prefill on the other hand..

0rand · May 18, 2026, 9:59pm

I would not waste time with RedHatAI quant. They are always first and always buggy. Probably useful to quantizers as an example but not for actual use. I love how Qwen and Mistral provide original quality Nvfp4 quant, as well as Nvidia, of course. The rest is a gamble, imo. Sometimes very good, mostly require patches that suffer from version drift and font work one week after release as nobody properly pin the versions

slackyrabbit · May 19, 2026, 12:39am

This recipe is insane. from 25t/s(with long warmup) → 37t/s(instant full speed). thank you for sharing it.

tonyd615 · May 19, 2026, 1:43am

Are you talking to me ? I am so new to this stuff so knowing i did my first little thing for the community means alot. I didn’t build the recipes but my and my agent was able to get it working.

slackyrabbit · May 19, 2026, 3:52am

Yes. I was talking to you. even if you build it by yourself, you shared the information that it worked.

tonyd615 · May 19, 2026, 3:55am

Thank you sir.

tonyd615 · May 19, 2026, 3:56am

Checkout Mimo v2.5 Slacky identical speeds in my opinion but its OMNI MiMo-V2.5-NVFP4 on 2x Spark Cluster - Recipe, findings, fixes, benchmarks

0rand · May 19, 2026, 5:56am

Thanks for sharing. Have you measured how much ram tokens take? What is realistic token cache on twin spark setup?

Ps I have two gigabyte atom dg10s, haven’t ordered cables yet. Still researching do I need one cx7 or two. They seem to be bus bound at 125 gbs, theoretically two can speed up if one for outbound, another for inbound, but can it be configured in practice?

mrDragonFox · May 19, 2026, 8:41am

idk why everyone posts about mimo in a ds thread - just open a new thread or get to implement it your self - this is about deepseek v4 flash not mimo - both models have different usecases

aidendle94 · May 19, 2026, 8:57am

Hello. I have an optimized fork of jasl’s branch. Im still doing clean up but at 200k context I am able to get 700 t/s. Prefill

GSM8k at .96 and Haystack 128k and 200k passing.

Will be testing and tweaking further throughout week and share with you guys. Planning to take next week off from work to keep optimizing.

tonyd615 · May 19, 2026, 10:24am

I made this thread I was just letting people know about it as well because a lot of the people in here are also working to get Mimo Working too

paxren2020 · May 19, 2026, 2:13pm

In the model card, Red Hat explicitly stated that the accuracy recovery for this quantization isn’t great. Why use it when there are other alternatives available?

dashtotherock · May 19, 2026, 5:29pm

This is a dummy question but I could not use --vllm-repo with ./build-and-copy.sh, could you clarify what I should do to rebuild the docker image? (I pull the latest spark-vllm-docker)

Topic		Replies	Views
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	51	4369	June 6, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	15111	May 18, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1644	May 11, 2026
Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4 DGX Spark / GB10 Projects gaming , llama , deepseek	65	5229	May 30, 2026
DeepSeek V4 Flash: Bringing Frontier AI to the Home DGX Spark / GB10 deepseek	11	2147	May 17, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	13	1137	June 4, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	8333	March 28, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4488	February 27, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2547	December 25, 2025
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5791	March 16, 2026

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers

Related topics