Validating Your Formula: 4-Node TP=4 Benchmark Data + The η Coefficient
Great analysis @flash3. I ran a systematic benchmark to validate your tok/s = β / (W + KV) formula on my 4x DGX Spark cluster running GLM-4.7-FP8 (355B MoE, ~32B active).
Setup
- 4x DGX Spark, TP=4, 200Gbps RoCE (MikroTik CRS812 DDQ)
- GLM-4.7-FP8, EAGLE speculative decoding, custom MoE kernel configs
- SGLang v0.5.4.post2 with
--tool-call-parser glm47
The Missing Coefficient: η ≈ 0.22
Your formula predicts the shape of degradation perfectly, but real-world throughput is ~22% of theoretical maximum. The corrected formula:
tok/s = η × (β × TP) / (W + KV)
Where η ≈ 0.22 captures NCCL/RDMA overhead, framework pipeline bubbles, and attention compute cost.
Here’s the data:
| Context |
KV (GB) |
Measured tok/s |
Predicted tok/s |
η |
| 512 |
0.07 |
7.50 |
34.05 |
0.22 |
| 1,024 |
0.15 |
7.58 |
33.97 |
0.22 |
| 2,048 |
0.30 |
7.48 |
33.81 |
0.22 |
| 4,096 |
0.60 |
7.41 |
33.50 |
0.22 |
| 8,192 |
1.19 |
7.24 |
32.90 |
0.22 |
| 16,384 |
2.38 |
6.99 |
31.76 |
0.22 |
| 32,768 |
4.76 |
6.54 |
29.70 |
0.22 |
η is constant across all context lengths. This means the overhead is context-independent — it’s a fixed efficiency loss from the distributed pipeline, not something that gets worse with longer context.
Your “17K = 1 tok/s” Claim
You’re right for a single DGX Spark. But with TP=4:
- Your prediction (1 node, GLM-4.7-Flash): ~1 tok/s at 17K context
- My measurement (4 nodes, GLM-4.7-FP8): 6.99 tok/s at 16K context
TP=4 effectively multiplies the bandwidth denominator, pushing the “agentic death zone” much further out. At 32K context I’m still getting 6.54 tok/s — only 13% degradation from baseline.
Agentic Coding: Usable, With Caveats
I ran a 10-turn tool calling simulation (read files, write code, run commands):
| Turns 1-7 |
Stable 7-9 tok/s, all tool calls correct |
| Turns 8-10 |
Tool call format broke at ~3K accumulated tokens |
The failure at turn 8 is not bandwidth — Test 1 proves 32K context works fine. It’s a conversation template / tool message accumulation issue in the serving framework. Fixable in software.
So: DGX Spark can do agentic coding on a 4-node cluster, but you need context window management after ~7 tool-heavy turns.
EAGLE: A Measurement Trap
One finding worth sharing: streaming client measurement underreports EAGLE performance by ~2.2x. EAGLE generates multiple tokens per speculative step, but the streaming API chunks them individually. My streaming client showed EAGLE ON as slower than EAGLE OFF — completely wrong.
Server-side reality:
|
EAGLE OFF |
EAGLE ON |
| Server throughput |
~14 tok/s |
16.77 tok/s |
| Real improvement |
— |
+20% |
Anyone benchmarking EAGLE on DGX Spark: use server-side metrics, not streaming client timing.
Connecting to Hot Expert Replication
Your Hot Expert Replication thesis is the logical next step. My current TP=4 setup shards everything uniformly. If hot experts (the ~20% handling ~80% of activations) were replicated locally, the inter-node traffic would drop dramatically — your math shows 50% remote → 10% remote.
This would improve η. Right now η=0.22 includes significant NCCL all-reduce overhead on every MoE layer. If 80% of expert activations resolved locally, that overhead drops for most tokens, potentially pushing η toward 0.30+. That would mean ~9-10 tok/s at 32K context instead of 6.5.
Bottom Line
Your formula is correct in structure. The practical version for multi-node DGX Spark:
tok/s = η × (β × TP) / (W + KV)
η ≈ 0.22 (current, uniform TP sharding)
η ≈ 0.30+ (projected, with hot expert replication)
The DGX Spark isn’t a chat-only device — it’s a bandwidth-constrained device that rewards smart engineering. TP scaling, MoE kernel tuning, and EAGLE already make agentic coding viable. Hot expert replication could make it genuinely competitive.
Full benchmark scripts and data: GitHub - BTankut/dgx-spark-sglang-moe-configs