Trying this now, and serapis I used your recipe for Minimax 2.7 in the past. Will you say DS4 Flash is “better” I feel like this model is faster and it does a better job on the agent side for me. It knows where to look and getting the job done, also its not stuck behind a license either.
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers
I’d argue it is more capable. I personally look forward to them releasing the vision-capable version of the model and hope that we have more improvements in vLLM by then.
I was able to run it with 512K today but it got incredibly slow. @jasl thanks for continuing to improve DSV4 on vLLM for us. Any hunch if any of your work will be merged anytime soon or will we have to continue via workarounds such as building from forks etc?
sera so as of right now the best recipe is the one you posted above ? I’d like to try it if so
I won’t claim it’s the best recipe. I need to run it much longer to understand the quirks. I am sure there is still headroom for optimization.
I have a short conversation with them
TD; LR;
I work for myself and people who trust me.
I don’t care what the vLLM team thought.
Unfortunately, I can’t test 512K for now, but I will try to improve the prefill performance and personal use cases. I have collected metrics this weekend, and will forward the work next week.
Thank you so much for putting this info out here. I’ve been running DeepSeek v4 Flash on my dual node cluster for a few days now and it’s been doing really well. For anyone who’s interested in running MAX thinking mode, you can add the reasoning_effort parameter to the --default-chat-template-kwargs flag as shown below. I probably won’t be using the MAX setting for everything as it does add a considerable amount of time, but it did help with refactoring one of my more complex websites to a Flutter app.
–default-chat-template-kwargs ‘{{“thinking”:true, “preserve_thinking”:true, “reasoning_effort”:“max”}}’
I am waiting for nvFP4 quant, if it ever arrives. Anything I tested running in FP8 is insanely slow. Memory bandwidth limits, only worthened by cx7 in a cluster config.
There’s already an NVFP4 quant from Redhat, but DeepSeek also released many layers in 4 bit quant hence the small size. Decode speed doesn’t seem to be that bad at the moment - prefill on the other hand..
I would not waste time with RedHatAI quant. They are always first and always buggy. Probably useful to quantizers as an example but not for actual use. I love how Qwen and Mistral provide original quality Nvfp4 quant, as well as Nvidia, of course. The rest is a gamble, imo. Sometimes very good, mostly require patches that suffer from version drift and font work one week after release as nobody properly pin the versions
This recipe is insane. from 25t/s(with long warmup) → 37t/s(instant full speed). thank you for sharing it.
Are you talking to me ? I am so new to this stuff so knowing i did my first little thing for the community means alot. I didn’t build the recipes but my and my agent was able to get it working.
Yes. I was talking to you. even if you build it by yourself, you shared the information that it worked.
Thank you sir.
Checkout Mimo v2.5 Slacky identical speeds in my opinion but its OMNI MiMo-V2.5-NVFP4 on 2x Spark Cluster - Recipe, findings, fixes, benchmarks
Thanks for sharing. Have you measured how much ram tokens take? What is realistic token cache on twin spark setup?
Ps I have two gigabyte atom dg10s, haven’t ordered cables yet. Still researching do I need one cx7 or two. They seem to be bus bound at 125 gbs, theoretically two can speed up if one for outbound, another for inbound, but can it be configured in practice?
idk why everyone posts about mimo in a ds thread - just open a new thread or get to implement it your self - this is about deepseek v4 flash not mimo - both models have different usecases
Hello. I have an optimized fork of jasl’s branch. Im still doing clean up but at 200k context I am able to get 700 t/s. Prefill
GSM8k at .96 and Haystack 128k and 200k passing.
Will be testing and tweaking further throughout week and share with you guys. Planning to take next week off from work to keep optimizing.
I made this thread I was just letting people know about it as well because a lot of the people in here are also working to get Mimo Working too
This is a dummy question but I could not use --vllm-repo with ./build-and-copy.sh, could you clarify what I should do to rebuild the docker image? (I pull the latest spark-vllm-docker)
