Deepseek V4 released

I think people with 4 DGX Spark are good to go. Rest of us needs to wait for quantized version.

deepseek-ai/DeepSeek-V4-Flash This one looks like we might be able to run an FP8 quant in a single node…Maybe?

Interesting indeed!

Deepseek made a quantized version.

That’s the one I was referring to, Looks like a smaller model rather than a quantization. That one should run in a dual spark, but if all of those are active parameters, it’ll be slow… :) I would definitely love an FP8 version of that one though :)

Looks like the architecture is unique… not something to easily toss into vLLM with the current versions, we’ll have to figure this out. Super excited to see where this goes, my quad gb10 cluster is stoked.

DeepSeek-V4-Flash is 284B parameters A13B activated – So no for single spark :(

I was just reading that too

And it’s an FP4+FP8 mixed one with over 130GB in size. We will have to have an INT4 or something like that to barely aspire to maybe running it

All, this is huge. Make sure to read the technical report at https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf for a glimpse of the innovation behind this series of frontier level open-weight model.

From an operational point of view, DeepSeek-V4-Flash is the one in the series we will be tinkering with pretty soon, with 284B parameters (13B activated).

Numbers (Total Parameters) looks different from what I see if I click on individual models.

Other than the architecture being new, my suspicion is that this should be very amenable to an approach like PrismaQuant set to map the 4 and 8-bit layers in DS-4-Flash to NVFP4 and MXFP8. The weight size of the Flash model is 160GB per HF, so it should fit on 2x Sparks with room for generous context (my bet around 700k, would be awesome to hit 1M though). That’s a very straightforward way to near-transparent quality and performance on 2x GB10s.

The reason it is marked different in some places on HF is that it’s directly provided as a mixed precision quant. I think over half of it is 4-bit.

It will probably be after the weekend before that becomes practical, though, because there is basically no day-0 support from anyone.

Which is really quite notable: they did not partner with the major backends. At all. The inference infrastructure is literally HF Transformers and… that’s it. They must be serving this with something custom internally. Maybe they tossed out PRs concurrent with release? I haven’t checked.

This probably sets us back around a week as vLLM and SGLang get a handle on the new architecture and PRs come in to support it as well as the chat interface, attention model, and thinking mode selection.

But all of that will come, because it looks to be a solid model and - after all that commotion regarding MiniMax M2.7 - still MIT licensed!

We might be lucky: DeepSeek-V4-Flash | vLLM Recipes

I’ll give this a go tomorrow once the model downloaded.

vllm published Docker containers specifically for this model which I assume includes whatever is needed even if it’s not merged yet:

Totally missed that, thanks!!

Waiting for kernel update to support 2 and 3 bits. ;-)

Sadly too large for a single DGX Spark, and I don’t wanna get a second one, I just got the first one :(

When you get 2, you want 4. When you get 4, 8 seem tempting. It is a never ending cycle. Even when you get B200 unit, you will want more. LLM is like drug, you can never get enough hardware. All those Chinese LLM companies are starved for GPU too.

Maybe it is time to setup LLM anonymous. We can do zoom call.

so true

I JUST picked up the second one. it hurt cuz now the price has gone up, but I have been wanting it for some time. It’s nice cuz now things like these larger models can fit, bit it was also so I could consistently self host a few models. They do feel like testing boxes, and so any time something good comes, it like, time to cloud host, or further invest.

So, the only real question is: „How long can you actually hold out before you double down and grab a second Spark?“

anyone was able to run it? vllm reference special image build, vllm/vllm-openai:deepseekv4-cu130, which obviously not supported well for sparks.