DeepSeek v4 Flash (IQ2XXS) on a single GB10!

I managed to get DeepSeek v4 Flash (IQ2XXS quantized) running on a single GB10!

(or rather, I managed to get an LLM to hack on it until it worked… can’t really take too much credit myself here.)

So far I’ve gone from not running at all to ~4t/s → ~8t/s → ~15t/s gen (at short context lengths).
Prefill is also slow at ~80t/s… but hey, it runs!

It’s still holding a CPU core at 100% and the GPU (according to nvidia-smi) at ~90% most of the time, so I’m expecting to get this running even faster within the upcoming days (and at this point I can let it just work on itself, hah).

Currently running it with these arguments:
`llama-server -hf antirez/deepseek-v4-gguf -c 524288 -np 1 --cont-batching -ngl all -fa 1 --batch-size 4096 --direct-io --no-mmap --jinja --fit off -t 10`
though I have had it OOM once so some tweaking is still needed.

Update: oops, looks like the version I pushed crashes on an assert… will update as soon as I can. fixed.

I think the more interesting route is the 2-bit quant format used in GitHub - antirez/ds4: DeepSeek 4 Flash local inference engine for Metal · GitHub

Quote:

This implementation only works with the DeepSeek V4 Flash GGUFs published for this project. It is not a general GGUF loader, and arbitrary DeepSeek/GGUF files will not have the tensor layout, quantization mix, metadata, or optional MTP state expected by the engine. The 2 bit quantizations provided here are not a joke: they behave well, work under coding agents, call tools in a reliable way. The 2 bit quants use a very asymmetrical quantization: only the routed MoE experts are quantized, up/gate at IQ2_XXS , down at Q2_K . They are the majority of all the model space: the other components (shared experts, projections, routing) are left untouched to guarantee quality.

That’s actually the quant this is made to work with.