While I was working on my vLLM/CUDA port of antirez/ds4’s newly famous custom hybrid quant of Deepseek 4 Flash which is optimized to work with high accuracy on 128GB M-series Macs, he was busy porting his MLX work to CUDA!
Today I’ve been testing his new CUDA work within his custom ds4 (newly renamed “DwarfStar 4”) inference engine, specifically made just for this model, optimized for 128GB systems. This is a game changer! It is already performant, accurate, and starts incredibly quickly on GB10.
I have created a convenience installer repo with a one-shot installer script:
curl -sSL https://raw.githubusercontent.com/Entrpi/ds4-on-spark/main/install.sh | bash -s -- --with-mtp --start
That one command:
- Verifies the host (aarch64, GB10/SM121, CUDA 13, ≥110 GiB free disk).
- Clones
antirez/ds4and buildsds4,ds4-server,ds4-benchwithCUDA_ARCH=sm_121… takes about 8 seconds on a Spark, no patches required. - Downloads the Q2 GGUF (~81 GiB) and the optional MTP draft GGUF (~3.6 GiB) from
antirez/deepseek-v4-gguf. - Runs the “capital of France” smoke test and asserts “Paris” in the output.
- Starts
ds4-serveron:8000(OpenAI v1-compatible streaming).
Startup is genuinely fast
Build (make -j20 CUDA_ARCH=sm_121) |
7.9 s |
| Cold load: 80.76 GiB of tensors → GPU cache | ~20 s |
| Time-to-first-token (cold process, short prompt) | ~21 s end-to-end |
After cold start the server stays warm: subsequent requests skip the load.
llama-benchy results
Using eugr/llama-benchy, 3 runs per row, --latency-mode generation:
| test | t/s | peak t/s | ttfr (ms) |
|---|---|---|---|
| pp2048 (prefill) | 364.5 ± 2.6 | — | 5890 |
| tg32 @ d=0 | 29.2 ± 1.4 | 31.0 | — |
| tg128 @ d=0 | 28.0 ± 1.0 | 34.0 | — |
| tg512 @ d=0 | 22.8 ± 2.6 | 33.3 | — |
| pp2048 @ d=4k | 339.5 ± 0.3 | — | 18712 |
| tg32 @ d=4k | 27.8 ± 1.3 | 29.3 | — |
| tg128 @ d=4k | 25.9 ± 0.5 | 30.0 | — |
| tg512 @ d=4k | 23.3 ± 2.2 | 32.3 | — |
| pp2048 @ d=16k | 310.7 ± 0.5 | — | 61401 |
| tg{32,128,512} @ d=16k | 24.1 / 24.5 / 24.2 | ~30 | — |
Reproduce on your own Spark once ds4-server is up:
uvx --from git+https://github.com/eugr/llama-benchy llama-benchy \
--base-url http://127.0.0.1:8000/v1 --model deepseek-v4-flash \
--pp 2048 --tg 32 128 512 --depth 0 4096 16384 \
--latency-mode generation
Roofline … how close to the hardware limit is this?
I measured the kernel-accessible memory bandwidth on Spark directly (small CUDA copy bench in the repo) and got ~215–227 GB/s vs the published GB10 LPDDR5X peak of ~273 GB/s (so ~82 % of theoretical, normal for real workloads).
The model’s effective bytes-per-token at decode comes out to ~8 GB (computed from the safetensors index: routed-expert active slice + always-on attention/indexer/embed/head/norms + KV reads at 16k). That gives a strict roofline of 225 / 8 ≈ 28 t/s.
ds4 is sitting at ~95 % of that at steady state. There’s basically no software-side decode speedup left on the table for this quant on this hardware — at least not the easy kind. Going faster would mean a tighter quant (FP4/1.5-bit), batched multi-user serving (amortizes weight reads), or faster hardware.
A note on “13 t/s vs 28 t/s”
Two metrics, same workload, both correct:
- ds4’s own log reports
avg=12.94 t/s— that’stotal_gen_tokens / total_decode_wall_time, including the ~1.0–1.3 s first-token post-prefill setup. - llama-benchy reports
tg=24.14 t/sat d=16k — that’s(N − 1) / (t_last − t_first), excluding first-token latency.
The first-token weight matters for interactive feel on short replies; the steady-state rate dominates long-form generation. Pick by use case.
MTP (speculative decode) … not ready yet
ds4 ships --mtp <draft.gguf> --mtp-draft N with a separate 3.6 GiB draft GGUF. It is currently broken, CUDA support hasn’t been fully ported from MLX.
Concurrency: ds4-server is single-stream
For multi-user serving heads-up: ds4-server processes requests strictly serially. With --concurrency 2 against llama-benchy, the second request waits for the first to finish (~168 s for a 35k-token prompt) before starting prefill. ds4 is intentionally narrow: single-session by design. If you need many concurrent users on one Spark, this isn’t the runtime for that; vLLM with paged-attention batching is the right tool there, I’ll continue working on optimizing that next.
Quality check
Single-prompt smoke tests come back clean. For example, asking it to write is_prime(n) with the 6k±1 optimization and list primes 100–130:
101, 103, 107, 109, 113, 127
All six correct, on first try, at Q2.
Massive thanks to @antirez for shipping this. Going from “Metal-only” to “first-class CUDA on Spark with a build knob” in a few days is wild.
