Fully custom CUDA-native Deepseek 4 Flash optimized for 1x Spark! antirez/ds4

While I was working on my vLLM/CUDA port of antirez/ds4’s newly famous custom hybrid quant of Deepseek 4 Flash which is optimized to work with high accuracy on 128GB M-series Macs, he was busy porting his MLX work to CUDA!

Today I’ve been testing his new CUDA work within his custom ds4 (newly renamed “DwarfStar 4”) inference engine, specifically made just for this model, optimized for 128GB systems. This is a game changer! It is already performant, accurate, and starts incredibly quickly on GB10.

I have created a convenience installer repo with a one-shot installer script:

GitHub - Entrpi/ds4-on-spark: antirez/ds4 (DwarfStar 4) on NVIDIA DGX Spark — install, benchmarks, and roofline analysis. Steady-state decode at ~95% of bandwidth ceiling; MTP and concurrency analyzed. · GitHub

curl -sSL https://raw.githubusercontent.com/Entrpi/ds4-on-spark/main/install.sh | bash -s -- --with-mtp --start

That one command:

  1. Verifies the host (aarch64, GB10/SM121, CUDA 13, ≥110 GiB free disk).
  2. Clones antirez/ds4 and builds ds4, ds4-server, ds4-bench with CUDA_ARCH=sm_121 … takes about 8 seconds on a Spark, no patches required.
  3. Downloads the Q2 GGUF (~81 GiB) and the optional MTP draft GGUF (~3.6 GiB) from antirez/deepseek-v4-gguf.
  4. Runs the “capital of France” smoke test and asserts “Paris” in the output.
  5. Starts ds4-server on :8000 (OpenAI v1-compatible streaming).

Startup is genuinely fast

Build (make -j20 CUDA_ARCH=sm_121) 7.9 s
Cold load: 80.76 GiB of tensors → GPU cache ~20 s
Time-to-first-token (cold process, short prompt) ~21 s end-to-end

After cold start the server stays warm: subsequent requests skip the load.

llama-benchy results

Using eugr/llama-benchy, 3 runs per row, --latency-mode generation:

test t/s peak t/s ttfr (ms)
pp2048 (prefill) 364.5 ± 2.6 5890
tg32 @ d=0 29.2 ± 1.4 31.0
tg128 @ d=0 28.0 ± 1.0 34.0
tg512 @ d=0 22.8 ± 2.6 33.3
pp2048 @ d=4k 339.5 ± 0.3 18712
tg32 @ d=4k 27.8 ± 1.3 29.3
tg128 @ d=4k 25.9 ± 0.5 30.0
tg512 @ d=4k 23.3 ± 2.2 32.3
pp2048 @ d=16k 310.7 ± 0.5 61401
tg{32,128,512} @ d=16k 24.1 / 24.5 / 24.2 ~30

Reproduce on your own Spark once ds4-server is up:

uvx --from git+https://github.com/eugr/llama-benchy llama-benchy \
    --base-url http://127.0.0.1:8000/v1 --model deepseek-v4-flash \
    --pp 2048 --tg 32 128 512 --depth 0 4096 16384 \
    --latency-mode generation

Roofline … how close to the hardware limit is this?

I measured the kernel-accessible memory bandwidth on Spark directly (small CUDA copy bench in the repo) and got ~215–227 GB/s vs the published GB10 LPDDR5X peak of ~273 GB/s (so ~82 % of theoretical, normal for real workloads).

The model’s effective bytes-per-token at decode comes out to ~8 GB (computed from the safetensors index: routed-expert active slice + always-on attention/indexer/embed/head/norms + KV reads at 16k). That gives a strict roofline of 225 / 8 ≈ 28 t/s.

ds4 is sitting at ~95 % of that at steady state. There’s basically no software-side decode speedup left on the table for this quant on this hardware — at least not the easy kind. Going faster would mean a tighter quant (FP4/1.5-bit), batched multi-user serving (amortizes weight reads), or faster hardware.

A note on “13 t/s vs 28 t/s”

Two metrics, same workload, both correct:

  • ds4’s own log reports avg=12.94 t/s — that’s total_gen_tokens / total_decode_wall_time, including the ~1.0–1.3 s first-token post-prefill setup.
  • llama-benchy reports tg=24.14 t/s at d=16k — that’s (N − 1) / (t_last − t_first), excluding first-token latency.

The first-token weight matters for interactive feel on short replies; the steady-state rate dominates long-form generation. Pick by use case.

MTP (speculative decode) … not ready yet

ds4 ships --mtp <draft.gguf> --mtp-draft N with a separate 3.6 GiB draft GGUF. It is currently broken, CUDA support hasn’t been fully ported from MLX.

Concurrency: ds4-server is single-stream

For multi-user serving heads-up: ds4-server processes requests strictly serially. With --concurrency 2 against llama-benchy, the second request waits for the first to finish (~168 s for a 35k-token prompt) before starting prefill. ds4 is intentionally narrow: single-session by design. If you need many concurrent users on one Spark, this isn’t the runtime for that; vLLM with paged-attention batching is the right tool there, I’ll continue working on optimizing that next.

Quality check

Single-prompt smoke tests come back clean. For example, asking it to write is_prime(n) with the 6k±1 optimization and list primes 100–130:

101, 103, 107, 109, 113, 127

All six correct, on first try, at Q2.

GitHub - Entrpi/ds4-on-spark: antirez/ds4 (DwarfStar 4) on NVIDIA DGX Spark — install, benchmarks, and roofline analysis. Steady-state decode at ~95% of bandwidth ceiling; MTP and concurrency analyzed. · GitHub

Massive thanks to @antirez for shipping this. Going from “Metal-only” to “first-class CUDA on Spark with a build knob” in a few days is wild.

So what could you do with this on 2 sparks ?

I am currently working on porting MTP support to CUDA.

Currently a one-Spark lab here, but would be happy to hear test results from dual-Spark users!

I can try it, what are you currently set at on context on this ?

The model is extremely efficient for KV cache, you can go up to the full 1M tokens, though probably no more than 200K is practical.

but no VLLM ?

Yes, I’m working on a vLLM port. It does work, but is slow currently. Should be able to optimize to a similar perf level soon.

I’ve been cobbling together a DS4 TP=2 version based on the CUDA work that was released. Not working yet but getting closer. If anyone is curious and/or has spare time to contribute: https://github.com/SeraphimSerapis/ds4/tree/tensorparallelism

I don’t understand why people are going through so much trouble for DS4 Flash when Mimo-V2.5 uses a fraction of the tokens while achieving higher scores, and Mimo-V2.5 is multimodal: text, image, video, and audio input modalities supported. DS4 Flash is text only.

Benchmarks.

Have people tested Mimo-V2.5 and found it to be worse than DS4 Flash? I’ve tested it and it seems quite good. I’ve also been able to fit at least 165k context on a single spark at IQ3_S quant, with the KV cache being full f16. (I think more would fit, but my Spark is crashing and needs to be RMA’d.)

Making some progress – given we work with a GGUF sharding can be a little complicated. But at least for a first pass, this should be okay.

I went from this:

| model                    |           test |            t/s |      peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:-------------------------|---------------:|---------------:|--------------:|-----------------:|-----------------:|-----------------:|
| antirez/deepseek-v4-gguf |         pp2048 | 354.70 ± 24.89 |               | 5452.92 ± 302.09 | 5452.12 ± 302.09 | 5753.35 ± 363.11 |
| antirez/deepseek-v4-gguf |          tg128 |   23.95 ± 1.11 |  35.11 ± 6.22 |                  |                  |                  |
| antirez/deepseek-v4-gguf | pp2048 @ d4096 |  330.77 ± 5.87 |               | 17044.51 ± 78.97 | 17013.52 ± 81.37 | 17343.37 ± 15.53 |
| antirez/deepseek-v4-gguf |  tg128 @ d4096 |   24.10 ± 1.11 | 52.50 ± 11.50 |                  |                  |                  |

to this today:

| model                    |           test |            t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------------------|---------------:|---------------:|-------------:|------------------:|------------------:|------------------:|
| antirez/deepseek-v4-gguf |         pp2048 | 358.89 ± 24.36 |              |  5195.30 ± 269.50 |  5194.43 ± 269.50 |  5447.16 ± 222.01 |
| antirez/deepseek-v4-gguf |          tg128 |   28.28 ± 2.76 | 44.67 ± 7.41 |                   |                   |                   |
| antirez/deepseek-v4-gguf | pp2048 @ d4096 |  330.89 ± 3.15 |              | 16971.69 ± 380.77 | 16970.82 ± 380.77 | 17237.54 ± 310.41 |
| antirez/deepseek-v4-gguf |  tg128 @ d4096 |   27.87 ± 1.18 | 45.33 ± 5.56 |                   |                   |                   |

For now I am not sure if this approach will make sense for two clusters or more given that technically we could fit the whole model in vLLM with two clusters if more of the upstream changes hit. I’ll keep on tweaking for a bit to see what’s possible.

I’d like to try MiMo-V2.5 if you could share your setup. I have kept an eye on this thread but it seems quite involved: MiMo-V2.5 (New model) - #21 by pfnguyen

because until very recently, mimo has been incredibly buggy and difficult to run.

It’s been easy to run for several days now using native llama.cpp support, with ggufs from Unsloth and AesSedai.

Running DS4-Flash requires a custom inference engine that appears to be a hard fork of llama.cpp, which seems far more difficult than simply downloading a gguf for the regular llama.cpp. So, yes, both models have been difficult… but one is significantly easier than the other now?

This is the gist of the command I’ve been using:

llama-server \
    -m /path/to/mimo-v2.5-ud-iq3_s.gguf \
    -fit off \
    -ngl 999 \
    -c 165000 \
    -ub 2048 \
    --parallel 1 \
    --cache-ram 0 \
    --ctx-checkpoints 1 \
    --temp 1.0 \
    --top-p 0.95 \
    --no-warmup \
    --no-mmap \
    --jinja

Image input was added to llama.cpp this morning, and I have not had time to experiment with that.

llama.cpp support via gguf is great, but not useful for clusters, yet (significantly [30-50%] slower when I tried it in the past using the then-new rdma support for rpc-server)

But the antirez engine for DS4 Flash that we’re talking about is not any better in that regard, is it?

I’m not saying either one is perfect, I just don’t understand when the criticisms at both seem the same. Mimo-V2.5 at least uses fewer tokens and supports multimodal. That seems like it would be worth more effort.

But, idk.

I hope a future iteration of DS4 Flash will be more token efficient and multimodal than the current preview.

Why do people even wanna run Q2 or Q3 of a model? wouldn’t a fp8 qwen 3.6 35B be much better in terms of quality and of course 4-5x faster?

No, I’m quite sure the Q3 of either of these models would be better than the Q8 of Qwen3.6 35B.

I would love to see more benchmarks of quantized models, of course, but the benchmarks I’ve seen show that quantization is fine. All else equal, Q8 is better, but all else is not equal.

I have MTP working and marginally faster than no-MTP now. Working on optimizations.

I would love if you can test it with tool-eval-bench --perf-only --context-pressure 0.6 to see how thr MTP behaves with a heavier context.

thanks!