I got GLM-5.2 NVFP4 running on four DGX Sparks at 128K context. (or optionally, 32K context so you can run DCP=1 and get >22 tps)
Objective: A high quality 4-bit quant running on 4x spark. Model: Mapika/GLM-5.2-NVFP4 · Hugging Face
TL;DR: 128k context at fp8_ds_mla, ~15-16 tps at c0 decode, falling to about ~13 tps decode at long context (this holds up really well)
Hardware: 4x standard nVidia-brand GB10 DGX Sparks, and a Microtik RoCE switch.
To quote the card:
The MoE expert FFNs (routed + shared) are quantized to NVFP4; attention (MLA + the DeepSeek-style DSA lightning indexer), the router, and the LM head are kept in BF16. This shrinks the checkpoint from 1.5 TB → 410 GB (~3.7×) while retaining GSM8K accuracy within ~2 points of BF16.
Why this is interesting: the model is too large and the memory is too tight to treat Spark like normal discrete-GPU hardware. The win was combining decode-context parallelism with aggressive system/Ray memory trimming. DCP4 shards the decode context across the four TP ranks, which is what makes 128K feasible. MTP1 then recovers enough generation speed to be usable.
Repos have scripts/recipes to help, in particular
Will merge the base model with an mtp layer.
Main result:
4x DGX Spark / GB10, one GPU per node
GLM-5.2 NVFP4 MTP hybrid checkpoint
vLLM fork with DCP + B12X sparse MLA patches
TP4 / PP1 / DCP4 / MTP1
fp8 KV cache, explicit 1.81 GB/rank
131,072 max model len
132,096 fitted KV tokens
512 tokens/s prefill
about 14.5-15.2 output tok/s on short-prompt codegen
Can be a tiny bit inconsistent, eg, on a 112k prompt uncached:
(APIServer pid=736) INFO 06-29 00:12:03 [loggers.py:277] Engine 000: Avg prompt throughput: 511.6 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.2%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:12:13 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.1%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:12:23 [loggers.py:277] Engine 000: Avg prompt throughput: 511.9 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.0%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:12:33 [loggers.py:277] Engine 000: Avg prompt throughput: 511.9 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.8%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:12:43 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.7%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:12:53 [loggers.py:277] Engine 000: Avg prompt throughput: 409.6 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.8%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:13:03 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.7%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:13:13 [loggers.py:277] Engine 000: Avg prompt throughput: 511.9 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.6%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:13:23 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 37.5%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:13:33 [loggers.py:277] Engine 000: Avg prompt throughput: 409.5 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.6%, Prefix cache hit rate: 0.0%
(APIServer pid=736) INFO 06-29 00:13:43 [loggers.py:277] Engine 000: Avg prompt throughput: 512.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 44.5%, Prefix cache hit rate: 0.0%
why the drop to 409 and so consistently? Not sure. And the 409/512 is so oddly consistently inconsistent.
A little asterisk here is I normally would consider fp8 kv cache to be a bad plan for quality but my take is that the `fp8_ds_mla` format with B12X_MLA_SPARSE is not just a typical tensor-scaled fp8. This could be its own post.
The setup is not just stock vLLM. It uses a patched vLLM branch with the dark-devotion DCP work, B12X sparse MLA pieces, FlashInfer/CUTLASS MoE, and a small Spark-specific fix to disable the TP/DCP message-queue broadcaster path that was hanging in multi-node Ray startup. NCCL/RDMA remains enabled over the Spark fabric.
So to even **have a chance** to launch it, the first thing is you have to prune and I mean **prune**.
The Ray setup is intentionally tiny:
Dashboard disabled
Log monitor disabled
Usage stats disabled
Object store 128 MiB
Object spilling to /var/tmp/ray-spill
1 CPU and 1 GPU advertised per node
host networking and host IPC
The OS also matters. I disabled irrelevant headless-node services like cups, avahi, bluetooth, ModemManager, colord, fwupd, packagekit, desktop portal/pipewire pieces, etc. **Important**: this disables the desktop GUI; only do this on headless inference nodes. On Spark unified memory, a few GB of random Linux/userland overhead can be the difference between fitting and failing.
What do you get out of this?
Some measured numbers, split the way they should be read:
Short codegen decode, MTP1: about 14.5-15.2 tok/s
Long-prompt prefill: about 450-500 input tok/s in the 16K-112K tests
Post-TTFT decode: about 13 tok/s at 32K-112K prompt sizes
Important caveat on concurrency: the 128K profile is `MAX_NUM_SEQS=1`, so concurrent requests queue. This is a single-long-context recipe, not a batch-serving recipe. A batch-oriented variant should raise `MAX_NUM_SEQS` and re-fit the KV budget, probably by lowering max context. Exercise left to the reader. But given the custom bits I certainly would not *automatically assume correctness* here.
What did **not** work:
BF16 KV at 128K: did not fit with enough headroom
DCP4/MTP3: later speculative positions collapsed in acceptance
DCP4/MTP2: sometimes competitive, but not stable enough to make default
NCCL_IB_DISABLE=1: if you leverage LLMs to help tune they have a tendency to drop this still (getting better); just say no. You don't have infiniband on spark but the interconnect works.
Stock container assumptions: not enough for this stack
If you cut context to 32k- you can then run DCP=1 and then I was able to get ~27 tps, so on DGX spark there is a very real and painful tradeoff.
Part of my mission was NOT to jump straight to a REAP model here,
A key practical detail: use the hybrid checkpoint that actually contains `model.layers.78.*`. The base GLM checkpoint can advertise MTP metadata without the real MTP layer. This setup has exactly one MTP layer, so MTP1 is the clean production point. MTP2/MTP3 recursively reuse the same one-step predictor and are research territory.
Now a comment here because **something really looks buggy** but 30 hours into trying to figure it out I couldn’t get to the bottom of it, but what I see is acceptance collapse that makes it look like instead of MTP acceptance doing something like
0.9, 0.75, 0.6
I see
0.9, (0.75^4), (0.6^4)
So MTP works fine, and whatever is going on in the code for 2/3 is likely interfered with by one of the many possible spoilers: DCP=4, extreme memory tightness, sm121 quirks, whatever.
MTP2 at one point was arguably a fraction of a point better than MTP1 on some parameters.
The full guide and scripts are in the repo recipe:
The vLLM patch branch is:
I may keep tinkering with the repos/docs, but the baseline is simple:
DCP4 / 128K / MTP1
B12X sparse MLA
flashinfer_cutlass MoE
fp8_ds_mla KV
Ray slimmed down
IB/RDMA enabled
Footnote: the same 112k prompt I was using to test, sent to an 8x RTX 6000 Pro Blackwell (with 4+4 behind PCI switches) could do ~2800 tps prefill… and also decoded the 521 tokens of output at around 13 tps, which I found interesting. Although that same hardware also if given a naked ~c=0 codegen prompt will output about 106 tps bs=1 and ~420 tps bs=8 decode on shorter contexts.
Certainly a takeaway here is that the long context handling not super impactful.
Now, if you do the same thing, and you cram the same way, and limit your context to 32K instead of 128K, you can then switch to DCP=1, MTP=3, and I get ~23 tps. My constraint notes for that run:
Model: GLM-5.2-NVFP4-MTP-hybrid
Served model: glm52-mtp3-dcp1-32k
Image: glm-darkdevotion-b12x:20260626-arm64-mtp-topkstep2
TP: 4
PP: 1
DCP: 1
MTP: 3
max_model_len: 32768
KV dtype: fp8
KV allocation: kv_cache_memory_bytes=1810000000
Attention: B12X MLA sparse
MoE: FlashInfer CUTLASS
Top-k: VLLM_MTP_RECOMPUTE_TOPK_FROM_STEP=1
Capacity: 33,024 KV tokens
Concurrency at 32K: 1.01x
Prompt: standard 512-token codegen probe