Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4

entrpi · May 12, 2026, 4:42am

While I was working on my vLLM/CUDA port of antirez/ds4’s newly famous custom hybrid quant of Deepseek 4 Flash which is optimized to work with high accuracy on 128GB M-series Macs, he was busy porting his MLX work to CUDA!

Today I’ve been testing his new CUDA work within his custom ds4 (newly renamed “DwarfStar 4”) inference engine, specifically made just for this model, optimized for 128GB systems. This is a game changer! It is already performant, accurate, and starts incredibly quickly on GB10.

I have created a convenience installer repo with a one-shot installer script:

GitHub - Entrpi/ds4-on-spark: antirez/ds4 (DwarfStar 4) on NVIDIA DGX Spark — install, benchmarks, and roofline analysis. Steady-state decode at ~95% of bandwidth ceiling; MTP and concurrency analyzed. · GitHub

curl -sSL https://raw.githubusercontent.com/Entrpi/ds4-on-spark/main/install.sh | bash -s -- --with-mtp --start

That one command:

Verifies the host (aarch64, GB10/SM121, CUDA 13, ≥110 GiB free disk).
Clones antirez/ds4 and builds ds4, ds4-server, ds4-bench with CUDA_ARCH=sm_121 … takes about 8 seconds on a Spark, no patches required.
Downloads the Q2 GGUF (~81 GiB) and the optional MTP draft GGUF (~3.6 GiB) from antirez/deepseek-v4-gguf.
Runs the “capital of France” smoke test and asserts “Paris” in the output.
Starts ds4-server on :8000 (OpenAI v1-compatible streaming).

Startup is genuinely fast


Build (`make -j20 CUDA_ARCH=sm_121`)	7.9 s
Cold load: 80.76 GiB of tensors → GPU cache	~20 s
Time-to-first-token (cold process, short prompt)	~21 s end-to-end

After cold start the server stays warm: subsequent requests skip the load.

llama-benchy results

Using eugr/llama-benchy, 3 runs per row, --latency-mode generation:

test	t/s	peak t/s	ttfr (ms)
pp2048 (prefill)	364.5 ± 2.6	—	5890
tg32 @ d=0	29.2 ± 1.4	31.0	—
tg128 @ d=0	28.0 ± 1.0	34.0	—
tg512 @ d=0	22.8 ± 2.6	33.3	—
pp2048 @ d=4k	339.5 ± 0.3	—	18712
tg32 @ d=4k	27.8 ± 1.3	29.3	—
tg128 @ d=4k	25.9 ± 0.5	30.0	—
tg512 @ d=4k	23.3 ± 2.2	32.3	—
pp2048 @ d=16k	310.7 ± 0.5	—	61401
tg{32,128,512} @ d=16k	24.1 / 24.5 / 24.2	~30	—

Reproduce on your own Spark once ds4-server is up:

uvx --from git+https://github.com/eugr/llama-benchy llama-benchy \
    --base-url http://127.0.0.1:8000/v1 --model deepseek-v4-flash \
    --pp 2048 --tg 32 128 512 --depth 0 4096 16384 \
    --latency-mode generation

Roofline … how close to the hardware limit is this?

I measured the kernel-accessible memory bandwidth on Spark directly (small CUDA copy bench in the repo) and got ~215–227 GB/s vs the published GB10 LPDDR5X peak of ~273 GB/s (so ~82 % of theoretical, normal for real workloads).

The model’s effective bytes-per-token at decode comes out to ~8 GB (computed from the safetensors index: routed-expert active slice + always-on attention/indexer/embed/head/norms + KV reads at 16k). That gives a strict roofline of 225 / 8 ≈ 28 t/s.

ds4 is sitting at ~95 % of that at steady state. There’s basically no software-side decode speedup left on the table for this quant on this hardware — at least not the easy kind. Going faster would mean a tighter quant (FP4/1.5-bit), batched multi-user serving (amortizes weight reads), or faster hardware.

A note on “13 t/s vs 28 t/s”

Two metrics, same workload, both correct:

ds4’s own log reports avg=12.94 t/s — that’s total_gen_tokens / total_decode_wall_time, including the ~1.0–1.3 s first-token post-prefill setup.
llama-benchy reports tg=24.14 t/s at d=16k — that’s (N − 1) / (t_last − t_first), excluding first-token latency.

The first-token weight matters for interactive feel on short replies; the steady-state rate dominates long-form generation. Pick by use case.

MTP (speculative decode) … not ready yet

ds4 ships --mtp <draft.gguf> --mtp-draft N with a separate 3.6 GiB draft GGUF. It is currently broken, CUDA support hasn’t been fully ported from MLX.

Concurrency: `ds4-server` is single-stream

For multi-user serving heads-up: ds4-server processes requests strictly serially. With --concurrency 2 against llama-benchy, the second request waits for the first to finish (~168 s for a 35k-token prompt) before starting prefill. ds4 is intentionally narrow: single-session by design. If you need many concurrent users on one Spark, this isn’t the runtime for that; vLLM with paged-attention batching is the right tool there, I’ll continue working on optimizing that next.

Quality check

Single-prompt smoke tests come back clean. For example, asking it to write is_prime(n) with the 6k±1 optimization and list primes 100–130:

101, 103, 107, 109, 113, 127

All six correct, on first try, at Q2.

GitHub - Entrpi/ds4-on-spark: antirez/ds4 (DwarfStar 4) on NVIDIA DGX Spark — install, benchmarks, and roofline analysis. Steady-state decode at ~95% of bandwidth ceiling; MTP and concurrency analyzed. · GitHub

Massive thanks to @antirez for shipping this. Going from “Metal-only” to “first-class CUDA on Spark with a build knob” in a few days is wild.

tonyd615 · May 12, 2026, 5:56am

So what could you do with this on 2 sparks ?

entrpi · May 12, 2026, 6:28am

I am currently working on porting MTP support to CUDA.

entrpi · May 12, 2026, 6:29am

Currently a one-Spark lab here, but would be happy to hear test results from dual-Spark users!

tonyd615 · May 12, 2026, 6:39am

I can try it, what are you currently set at on context on this ?

entrpi · May 12, 2026, 6:52am

The model is extremely efficient for KV cache, you can go up to the full 1M tokens, though probably no more than 200K is practical.

tonyd615 · May 12, 2026, 7:18am

but no VLLM ?

entrpi · May 12, 2026, 9:28am

Yes, I’m working on a vLLM port. It does work, but is slow currently. Should be able to optimize to a similar perf level soon.

serapis · May 12, 2026, 12:41pm

I’ve been cobbling together a DS4 TP=2 version based on the CUDA work that was released. Not working yet but getting closer. If anyone is curious and/or has spare time to contribute: https://github.com/SeraphimSerapis/ds4/tree/tensorparallelism

coder543 · May 12, 2026, 3:18pm

I don’t understand why people are going through so much trouble for DS4 Flash when Mimo-V2.5 uses a fraction of the tokens while achieving higher scores, and Mimo-V2.5 is multimodal: text, image, video, and audio input modalities supported. DS4 Flash is text only.

Benchmarks.

Have people tested Mimo-V2.5 and found it to be worse than DS4 Flash? I’ve tested it and it seems quite good. I’ve also been able to fit at least 165k context on a single spark at IQ3_S quant, with the KV cache being full f16. (I think more would fit, but my Spark is crashing and needs to be RMA’d.)

serapis · May 12, 2026, 4:59pm

Making some progress – given we work with a GGUF sharding can be a little complicated. But at least for a first pass, this should be okay.

I went from this:

| model                    |           test |            t/s |      peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------|---------------:|---------------:|--------------:|-----------------:|-----------------:|-----------------:|
| antirez/deepseek-v4-gguf |         pp2048 | 354.70 ± 24.89 |               | 5452.92 ± 302.09 | 5452.12 ± 302.09 | 5753.35 ± 363.11 |
| antirez/deepseek-v4-gguf |          tg128 |   23.95 ± 1.11 |  35.11 ± 6.22 |                  |                  |                  |
| antirez/deepseek-v4-gguf | pp2048 @ d4096 |  330.77 ± 5.87 |               | 17044.51 ± 78.97 | 17013.52 ± 81.37 | 17343.37 ± 15.53 |
| antirez/deepseek-v4-gguf |  tg128 @ d4096 |   24.10 ± 1.11 | 52.50 ± 11.50 |                  |                  |                  |

to this today:

| model                    |           test |            t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------------------|---------------:|---------------:|-------------:|------------------:|------------------:|------------------:|
| antirez/deepseek-v4-gguf |         pp2048 | 358.89 ± 24.36 |              |  5195.30 ± 269.50 |  5194.43 ± 269.50 |  5447.16 ± 222.01 |
| antirez/deepseek-v4-gguf |          tg128 |   28.28 ± 2.76 | 44.67 ± 7.41 |                   |                   |                   |
| antirez/deepseek-v4-gguf | pp2048 @ d4096 |  330.89 ± 3.15 |              | 16971.69 ± 380.77 | 16970.82 ± 380.77 | 17237.54 ± 310.41 |
| antirez/deepseek-v4-gguf |  tg128 @ d4096 |   27.87 ± 1.18 | 45.33 ± 5.56 |                   |                   |                   |

For now I am not sure if this approach will make sense for two clusters or more given that technically we could fit the whole model in vLLM with two clusters if more of the upstream changes hit. I’ll keep on tweaking for a bit to see what’s possible.

simon306 · May 12, 2026, 6:29pm

I’d like to try MiMo-V2.5 if you could share your setup. I have kept an eye on this thread but it seems quite involved: MiMo-V2.5 (New model) - #21 by pfnguyen

pfnguyen · May 12, 2026, 6:43pm

because until very recently, mimo has been incredibly buggy and difficult to run.

coder543 · May 12, 2026, 6:49pm

It’s been easy to run for several days now using native llama.cpp support, with ggufs from Unsloth and AesSedai.

Running DS4-Flash requires a custom inference engine that appears to be a hard fork of llama.cpp, which seems far more difficult than simply downloading a gguf for the regular llama.cpp. So, yes, both models have been difficult… but one is significantly easier than the other now?

This is the gist of the command I’ve been using:

llama-server \
    -m /path/to/mimo-v2.5-ud-iq3_s.gguf \
    -fit off \
    -ngl 999 \
    -c 165000 \
    -ub 2048 \
    --parallel 1 \
    --cache-ram 0 \
    --ctx-checkpoints 1 \
    --temp 1.0 \
    --top-p 0.95 \
    --no-warmup \
    --no-mmap \
    --jinja

Image input was added to llama.cpp this morning, and I have not had time to experiment with that.

pfnguyen · May 12, 2026, 8:07pm

llama.cpp support via gguf is great, but not useful for clusters, yet (significantly [30-50%] slower when I tried it in the past using the then-new rdma support for rpc-server)

coder543 · May 12, 2026, 8:26pm

But the antirez engine for DS4 Flash that we’re talking about is not any better in that regard, is it?

I’m not saying either one is perfect, I just don’t understand when the criticisms at both seem the same. Mimo-V2.5 at least uses fewer tokens and supports multimodal. That seems like it would be worth more effort.

But, idk.

I hope a future iteration of DS4 Flash will be more token efficient and multimodal than the current preview.

djordjestojanovic1992 · May 12, 2026, 11:06pm

Why do people even wanna run Q2 or Q3 of a model? wouldn’t a fp8 qwen 3.6 35B be much better in terms of quality and of course 4-5x faster?

coder543 · May 12, 2026, 11:15pm

No, I’m quite sure the Q3 of either of these models would be better than the Q8 of Qwen3.6 35B.

I would love to see more benchmarks of quantized models, of course, but the benchmarks I’ve seen show that quantization is fine. All else equal, Q8 is better, but all else is not equal.

entrpi · May 13, 2026, 2:21am

I have MTP working and marginally faster than no-MTP now. Working on optimizations.

azampatti · May 13, 2026, 4:07am

I would love if you can test it with tool-eval-bench --perf-only --context-pressure 0.6 to see how thr MTP behaves with a heavier context.

thanks!

Topic		Replies	Views
DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark — TP=2, MTP, 200K ctx, recipe + numbers DGX Spark / GB10 deepseek	140	8063	June 2, 2026
Deepseek V4 released DGX Spark / GB10 deepseek	143	14819	May 18, 2026
DeepSeekV4-Flash hybrid quant, 1x DGX Spark: antirez's optimized 128 GB MLX recipe ported to vLLM for GB10 DGX Spark / GB10 Projects deepseek	18	1567	May 11, 2026
Deepseek v4 Flash on 2 Nodes DGX Spark / GB10 Projects deepseek	47	4031	May 31, 2026
DeepSeek v4 Flash (IQ2XXS) on a single GB10! DGX Spark / GB10 Projects llm , llama , deepseek	5	2579	May 30, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2595	May 11, 2026
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	26	2261	April 7, 2026
MiMo-V2.5-NVFP4 on 2x Spark Cluster - Recipe, findings, fixes, benchmarks DGX Spark / GB10	38	2173	June 2, 2026
Anyone having luck with Deepseek V4 Flash on Dual Sparks? DGX Spark / GB10 deepseek	11	926	May 23, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	8218	March 28, 2026