Fitting a high-quality REAP-less GLM-5.2 onto 4x DGX Spark

I got GLM-5.2 NVFP4 running on four DGX Sparks at 128K context. (or optionally, 32K context so you can run DCP=1 and get >22 tps)

Objective: A high quality 4-bit quant running on 4x spark. Model: Mapika/GLM-5.2-NVFP4 · Hugging Face

TL;DR: 128k context at fp8_ds_mla, ~15-16 tps at c0 decode, falling to about ~13 tps decode at long context (this holds up really well)

Hardware: 4x standard nVidia-brand GB10 DGX Sparks, and a Microtik RoCE switch.

To quote the card:

The MoE expert FFNs (routed + shared) are quantized to NVFP4; attention (MLA + the DeepSeek-style DSA lightning indexer), the router, and the LM head are kept in BF16. This shrinks the checkpoint from 1.5 TB → 410 GB (~3.7×) while retaining GSM8K accuracy within ~2 points of BF16.

Why this is interesting: the model is too large and the memory is too tight to treat Spark like normal discrete-GPU hardware. The win was combining decode-context parallelism with aggressive system/Ray memory trimming. DCP4 shards the decode context across the four TP ranks, which is what makes 128K feasible. MTP1 then recovers enough generation speed to be usable.

Repos have scripts/recipes to help, in particular

Will merge the base model with an mtp layer.

Main result:

4x DGX Spark / GB10, one GPU per node

GLM-5.2 NVFP4 MTP hybrid checkpoint

vLLM fork with DCP + B12X sparse MLA patches

TP4 / PP1 / DCP4 / MTP1

fp8 KV cache, explicit 1.81 GB/rank

131,072 max model len

132,096 fitted KV tokens

512 tokens/s prefill

about 14.5-15.2 output tok/s on short-prompt codegen

Can be a tiny bit inconsistent, eg, on a 112k prompt uncached:

(APIServer pid=736) INFO 06-29 00:12:03 [loggers.py:277] Engine 000: Avg prompt throughput: 511.6 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.2%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:12:13 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.1%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:12:23 [loggers.py:277] Engine 000: Avg prompt throughput: 511.9 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.0%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:12:33 [loggers.py:277] Engine 000: Avg prompt throughput: 511.9 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.8%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:12:43 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.7%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:12:53 [loggers.py:277] Engine 000: Avg prompt throughput: 409.6 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.8%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:13:03 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.7%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:13:13 [loggers.py:277] Engine 000: Avg prompt throughput: 511.9 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.6%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:13:23 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 37.5%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:13:33 [loggers.py:277] Engine 000: Avg prompt throughput: 409.5 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.6%, Prefix cache hit rate: 0.0%

(APIServer pid=736) INFO 06-29 00:13:43 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 44.5%, Prefix cache hit rate: 0.0%

why the drop to 409 and so consistently? Not sure. And the 409/512 is so oddly consistently inconsistent.

A little asterisk here is I normally would consider fp8 kv cache to be a bad plan for quality but my take is that the `fp8_ds_mla` format with B12X_MLA_SPARSE is not just a typical tensor-scaled fp8. This could be its own post.

The setup is not just stock vLLM. It uses a patched vLLM branch with the dark-devotion DCP work, B12X sparse MLA pieces, FlashInfer/CUTLASS MoE, and a small Spark-specific fix to disable the TP/DCP message-queue broadcaster path that was hanging in multi-node Ray startup. NCCL/RDMA remains enabled over the Spark fabric.

So to even **have a chance** to launch it, the first thing is you have to prune and I mean **prune**.

The Ray setup is intentionally tiny:

Dashboard disabled

Log monitor disabled

Usage stats disabled

Object store 128 MiB

Object spilling to /var/tmp/ray-spill

1 CPU and 1 GPU advertised per node

host networking and host IPC

The OS also matters. I disabled irrelevant headless-node services like cups, avahi, bluetooth, ModemManager, colord, fwupd, packagekit, desktop portal/pipewire pieces, etc. **Important**: this disables the desktop GUI; only do this on headless inference nodes. On Spark unified memory, a few GB of random Linux/userland overhead can be the difference between fitting and failing.

What do you get out of this?

Some measured numbers, split the way they should be read:

Short codegen decode, MTP1: about 14.5-15.2 tok/s

Long-prompt prefill: about 450-500 input tok/s in the 16K-112K tests

Post-TTFT decode: about 13 tok/s at 32K-112K prompt sizes

Important caveat on concurrency: the 128K profile is `MAX_NUM_SEQS=1`, so concurrent requests queue. This is a single-long-context recipe, not a batch-serving recipe. A batch-oriented variant should raise `MAX_NUM_SEQS` and re-fit the KV budget, probably by lowering max context. Exercise left to the reader. But given the custom bits I certainly would not *automatically assume correctness* here.

What did **not** work:

BF16 KV at 128K: did not fit with enough headroom

DCP4/MTP3: later speculative positions collapsed in acceptance

DCP4/MTP2: sometimes competitive, but not stable enough to make default

NCCL_IB_DISABLE=1: if you leverage LLMs to help tune they have a tendency to drop this still (getting better); just say no. You don't have infiniband on spark but the interconnect works.

Stock container assumptions: not enough for this stack

If you cut context to 32k- you can then run DCP=1 and then I was able to get ~27 tps, so on DGX spark there is a very real and painful tradeoff.

Part of my mission was NOT to jump straight to a REAP model here,

A key practical detail: use the hybrid checkpoint that actually contains `model.layers.78.*`. The base GLM checkpoint can advertise MTP metadata without the real MTP layer. This setup has exactly one MTP layer, so MTP1 is the clean production point. MTP2/MTP3 recursively reuse the same one-step predictor and are research territory.

Now a comment here because **something really looks buggy** but 30 hours into trying to figure it out I couldn’t get to the bottom of it, but what I see is acceptance collapse that makes it look like instead of MTP acceptance doing something like

0.9, 0.75, 0.6

I see

0.9, (0.75^4), (0.6^4)

So MTP works fine, and whatever is going on in the code for 2/3 is likely interfered with by one of the many possible spoilers: DCP=4, extreme memory tightness, sm121 quirks, whatever.

MTP2 at one point was arguably a fraction of a point better than MTP1 on some parameters.

The full guide and scripts are in the repo recipe:

The vLLM patch branch is:

I may keep tinkering with the repos/docs, but the baseline is simple:

DCP4 / 128K / MTP1

B12X sparse MLA

flashinfer_cutlass MoE

fp8_ds_mla KV

Ray slimmed down

IB/RDMA enabled

Footnote: the same 112k prompt I was using to test, sent to an 8x RTX 6000 Pro Blackwell (with 4+4 behind PCI switches) could do ~2800 tps prefill… and also decoded the 521 tokens of output at around 13 tps, which I found interesting. Although that same hardware also if given a naked ~c=0 codegen prompt will output about 106 tps bs=1 and ~420 tps bs=8 decode on shorter contexts.

Certainly a takeaway here is that the long context handling not super impactful.

Now, if you do the same thing, and you cram the same way, and limit your context to 32K instead of 128K, you can then switch to DCP=1, MTP=3, and I get ~23 tps. My constraint notes for that run:

Model: GLM-5.2-NVFP4-MTP-hybrid
Served model: glm52-mtp3-dcp1-32k
Image: glm-darkdevotion-b12x:20260626-arm64-mtp-topkstep2
TP: 4
PP: 1
DCP: 1
MTP: 3
max_model_len: 32768
KV dtype: fp8
KV allocation: kv_cache_memory_bytes=1810000000
Attention: B12X MLA sparse
MoE: FlashInfer CUTLASS
Top-k: VLLM_MTP_RECOMPUTE_TOPK_FROM_STEP=1
Capacity: 33,024 KV tokens
Concurrency at 32K: 1.01x
Prompt: standard 512-token codegen probe
2 Likes