Why 273 GB/s? Less Is More, Until It Isn’t

I’d like to share some suspicious findings.

Is this a defect in the matrix?
Stupid routing?
Or a big fail in my histogram code?


With 64 experts, 4 selected (order does not matter):

C(64, 4) = 635,376 possible combinations.

In practice, 356,529 distinct tuples were observed — about 56% of all possible combinations.
However, the distribution is extremely uneven:

Tier Tuples Share of all tuples Share of traffic
Tier 1 (≥ 90%) 46 0.01% ~67%
Tier 2 (≥ 20%) 92 0.01% ~95%
Rest 356,437 99.97% ~5%

92 out of 635,376 possible combinations account for 95% of the traffic.
The model effectively uses the same fixed expert groups most of the time. (… in my test scenario of course).

expert fire count

layer-wise expert trajectory

When tracing the paths taken through the model, a very strong divergence emerges across all FFNs (experts per layer) — each path appears to be unique. Even partial paths already show strong divergence in the early layers, and it doesn’t get better from there. Speculating on repeating FFNs across a full token pass does not lead to any viable optimization approach. What remains is the search for hot experts and their replication or redundant pre-loading on additional nodes.

In the initial tests, experts were only counted by ID per layer, which is of course incorrect, since the experts in each layer represent different FFNs and therefore must be counted by both ID and layer. Nevertheless, a clustering pattern does emerge, which points to an optimization approach for distributing FFNs across all DGX nodes in a cluster.

It should also be noted that these results document that even complete copies can only be executed entirely on a single node if the model fits completely into VRAM. When distributing the model across DGX nodes, one could already redundantly co-distribute the occasional FFN, provided there is space available. …but all-to-all communication remains mandatory.

Co-Activation Aware Placement — Extending EPLB’s Per-Layer Tracking

…been digging into vLLM’s EPLB. Really like that it already tracks load per expert per layer — more granular than I expected. Hot expert replication and dynamic rebalancing are solid. Basically hot/cold handling from a different angle. (use num_redundant_experts)

But it optimizes for individual expert load, not for which experts fire together. EPLB answers “which experts are busy?” — not “which experts should be co-located?”

If experts 8, 23, and 32 consistently activate on the same token, placing them on the same node eliminates cross-node traffic for those calls. Load could be perfectly balanced but locality still suboptimal.

Playing around with tuple tracking on GLM-4-Flash. Data shows clear hot combos — feels like co-activation patterns could inform placement on top of EPLB.

But here’s the thing — the network cost might be negligible (as baristankut pointed out earlier in this thread):

What goes over the wire isn’t the weights, it’s the hidden states. For GLM-4-Flash:

  • Weights per expert: ~20 MB (stay local, loaded from RAM)
  • Hidden state per token: ~16 KB (goes over network)
  • Ratio: 1250:1

The bottleneck is memory bandwidth (273 GB/s) for loading weights, not network. Two nodes = double the memory bandwidth = ~1.7× speedup per layer. And - according to histograms - even if its a large model split to two nodes no path is the same (kind of entropic ordering chaos). This makes EP win because weights are huge and hidden states are tiny.

The latency sweet spot - my context problem:

Context is (like) weights in terms of memory bandwidth. While active weights stay the same, context grows (so in my usecases). Honestly this was and is the pain point for me. Interestingly, the gains scale with context length:

Context 1 Node 2 Nodes Gain
1K 37 tok/s 41 tok/s +11%
8K 20 tok/s 33 tok/s +65%
17K 10 tok/s 18 tok/s +80%
32K 5 tok/s 10 tok/s +100%

At long context, KV-cache saturates memory bandwidth entirely. A second node doubles aggregate bandwidth for everything — KV-cache, expert weights, shared layers.

So maybe co-activation placement is premature optimization? The real win from EP is memory bandwidth scaling, not network locality. Unless you’re on slow interconnects, all-to-all dispatch cost is negligible. More is less.

Still curious if anyone sees different numbers at scale. At the moment I have to state chaos scales best.

So finally, from a product manager’s perspective, DGX Spark was built to be bought twice — at least.

2 Likes

That was my conclusion as well, because of economics and how it scales for training. And that’s been the case according to microcenter sales people

If I hadn’t read somewhere in the marketing that this is “the AI revolution for pros” and that you can “deploy straight from Spark to data center”… I would’ve assumed it’s a consumer product they spiced up with leftovers from the datacenter business.

But I don’t want to trash it too hard. Maybe someone could ask if the next Spark — uh — “DGX Fire” could have a bit more of… everything? Or whether the DGX Spark really is the absolute baseline, the floor, the minimum, the thing that technically doesn’t even work unless you seriously mod it?

In 100 years they’ll say: "Back in 2025, that’s when the foundation was laid. But it was so deep underground they had to lay another one on top just to have something to build on.

“Built to be bought twice” is probably true for long-context agent workflows, not necessarily as product intent: your own numbers show the 1 to 2 node benefit grows with context and reaches aprox 2× at 32K, which fits the “KV-cache saturates bandwidth”model.

Beyond inference, the real win is iteration speed for research with local fine-tuning work without queue time. For students/researchers/developers who can’t reliably get cluster access, that’s often the difference between “can test ideas weekly” and can’t test at all.

And the spark scales really nicely for training

1 Like

Exactly. It’s made for training. Afterwards you’re either disappointed or a deep tech AI expert. According to other posts in this forum, maybe both.

Validating Your Formula: 4-Node TP=4 Benchmark Data + The η Coefficient

Great analysis @flash3. I ran a systematic benchmark to validate your tok/s = β / (W + KV) formula on my 4x DGX Spark cluster running GLM-4.7-FP8 (355B MoE, ~32B active).

Setup

  • 4x DGX Spark, TP=4, 200Gbps RoCE (MikroTik CRS812 DDQ)
  • GLM-4.7-FP8, EAGLE speculative decoding, custom MoE kernel configs
  • SGLang v0.5.4.post2 with --tool-call-parser glm47

The Missing Coefficient: η ≈ 0.22

Your formula predicts the shape of degradation perfectly, but real-world throughput is ~22% of theoretical maximum. The corrected formula:

tok/s = η × (β × TP) / (W + KV)

Where η ≈ 0.22 captures NCCL/RDMA overhead, framework pipeline bubbles, and attention compute cost.

Here’s the data:

Context KV (GB) Measured tok/s Predicted tok/s η
512 0.07 7.50 34.05 0.22
1,024 0.15 7.58 33.97 0.22
2,048 0.30 7.48 33.81 0.22
4,096 0.60 7.41 33.50 0.22
8,192 1.19 7.24 32.90 0.22
16,384 2.38 6.99 31.76 0.22
32,768 4.76 6.54 29.70 0.22

η is constant across all context lengths. This means the overhead is context-independent — it’s a fixed efficiency loss from the distributed pipeline, not something that gets worse with longer context.

Your “17K = 1 tok/s” Claim

You’re right for a single DGX Spark. But with TP=4:

  • Your prediction (1 node, GLM-4.7-Flash): ~1 tok/s at 17K context
  • My measurement (4 nodes, GLM-4.7-FP8): 6.99 tok/s at 16K context

TP=4 effectively multiplies the bandwidth denominator, pushing the “agentic death zone” much further out. At 32K context I’m still getting 6.54 tok/s — only 13% degradation from baseline.

Agentic Coding: Usable, With Caveats

I ran a 10-turn tool calling simulation (read files, write code, run commands):

Turns 1-7 Stable 7-9 tok/s, all tool calls correct
Turns 8-10 Tool call format broke at ~3K accumulated tokens

The failure at turn 8 is not bandwidth — Test 1 proves 32K context works fine. It’s a conversation template / tool message accumulation issue in the serving framework. Fixable in software.

So: DGX Spark can do agentic coding on a 4-node cluster, but you need context window management after ~7 tool-heavy turns.

EAGLE: A Measurement Trap

One finding worth sharing: streaming client measurement underreports EAGLE performance by ~2.2x. EAGLE generates multiple tokens per speculative step, but the streaming API chunks them individually. My streaming client showed EAGLE ON as slower than EAGLE OFF — completely wrong.

Server-side reality:

EAGLE OFF EAGLE ON
Server throughput ~14 tok/s 16.77 tok/s
Real improvement +20%

Anyone benchmarking EAGLE on DGX Spark: use server-side metrics, not streaming client timing.

Connecting to Hot Expert Replication

Your Hot Expert Replication thesis is the logical next step. My current TP=4 setup shards everything uniformly. If hot experts (the ~20% handling ~80% of activations) were replicated locally, the inter-node traffic would drop dramatically — your math shows 50% remote → 10% remote.

This would improve η. Right now η=0.22 includes significant NCCL all-reduce overhead on every MoE layer. If 80% of expert activations resolved locally, that overhead drops for most tokens, potentially pushing η toward 0.30+. That would mean ~9-10 tok/s at 32K context instead of 6.5.

Bottom Line

Your formula is correct in structure. The practical version for multi-node DGX Spark:

tok/s = η × (β × TP) / (W + KV)

η ≈ 0.22 (current, uniform TP sharding)
η ≈ 0.30+ (projected, with hot expert replication)

The DGX Spark isn’t a chat-only device — it’s a bandwidth-constrained device that rewards smart engineering. TP scaling, MoE kernel tuning, and EAGLE already make agentic coding viable. Hot expert replication could make it genuinely competitive.

Full benchmark scripts and data: GitHub - BTankut/dgx-spark-sglang-moe-configs

Wonderful. Thanks for the correction factors. But you see, this is where a DGX Spark owner starts. And I’ve learned a lot while investigating the FFNs — and although there’s plenty of chaos there, it helps to scale in with your (totally right) engineering approach. So it’s not made for pros, it’s made for people who like to be challenged.

And thanks for the QSFP56 ports, NVIDIA, by the way.

This “tool call format broke” issue is driver-based. It was worst in earlier 580 versions. It’s regression-safe, so if you’re not just chatting but actually using tool calls, it will eventually hit the point where the tooling breaks. Tested with 25.x and 26.01 nvidia images.

Really appreciate the driver insight, @flash3. I was blaming the tool call breakage on SGLang’s conversation template accumulation, but if you’re seeing it across 580.x, 25.x and 26.01 containers and it’s regression-safe, that’s a different layer entirely. Saves me from chasing the wrong problem.
One question — is this specifically CUDA/cuDNN side or NCCL/communication related? On my 4-node TP=4 setup, tool call generation is on the head node but token generation is distributed, so I’m curious if sharding adds another dimension to this.
And “made for people who like to be challenged” — honestly the best description of the Spark I’ve heard. The 101KB shared memory constraint alone taught me more about inference than a year on comfortable hardware. Agreed on the QSFP56 ports too — that’s what turns 4 separate boxes into an actual cluster.​​​​​​​​​​​​​​​​

At the moment the tool call breaking is a black box for me. I tried updating the driver first, which led to reduced occurrence. I’ve seen that 590 was released briefly for DGX Spark, so maybe it’s gone after that update? If not, I’ll investigate further.

I have an RTX PRO 6000 setup as a contrast to the DGX Spark. I’ve learned that from NVIDIA’s perspective, the Spark sits closer to the 5000 series architecture — but I’m not buying any 5xxx just to reproduce the same bug. On the RTX PRO 6000, same setup, same model, no problems. That’s my marker where everything points back to the Spark specifically.

And yes, it’s likely fixable, but I don’t plan to put my hands on the driver.


And finally — you only learn when it hurts. Although that might sound strange and like I’m blaming NVIDIA, it’s really just extracting the truth out of this situation. And to be honest: I’ve just ordered my second Spark. So the product and sales team got it right.

Guys, no offense, but this discussion reads like a conversation between two chatbots (and for the most part, it seems like it’s the case). There is a mix of original thoughts and hypotheses generated by a LLM, and that LLM part is often wrong than right (point in case - formulas).

Could you all please not blindly trust LLM output and digest/read/verify/edit it first before posting, otherwise it is a disservice to the community. It will also add to the “dead internet” and will degrade the quality of future LLMs. Can we do it at least here?

Everything I post is 100% me. Even when I use LLM to help me in research, I check the logic and verify/test it myself first. When it comes to the frontier field like this, LLM is wrong most of the times, so take its output with a grain of salt.

8 Likes

There are some novel thoughts here, but it does read like moltbook.

Sure, polishing is always nice. Ideas are not from llm. results neither. Feel free to discuss material not form. If something is wrong do not hesitate …

I have been thinking about experts from the context of performance.

Specifically, since we have limited memory bandwidth and kernel launch overhead both are limiting factors, if some elements of expert calculation could be reused (stay resident) between calculations it could unlock extra performance.

There are more direct optimizations that have a clear path like playing with tile shapes, epilogue optimizations, and fusing kernels that I’m working on right now.

I’m interested in your thought process though. What is your aim?

In fact, one of the limitation is memory bandwidth. I do not want to repeat all of the calculations. vram to gpu, weights mostly. the idea is to either reduce the amount of vram-gpu-traffic or to distribute the process over different nodes, so the bandwith aggregates.

problems: how could someone save weights if every weight is important for proper calculation? this is where statistics help when more than one token is processed. (f.i. dflash). The node distribution introduces latency and another datapath, as long as new datapath does not lead to new limitations, the additional (network) latency could be overcompensated if enough tokens could be processed in parallel (tp or ep). In my opinion dgx spark is strong memory bound. A lot of member of this forum argue that its not. So what do you think? Is one dgx spark fast enough for your model? Do you use large contexts?

i’m disappointed with current performance and just want more. I share experiences and ideas because i’m new to dgx spark. If you have working solutions, Im glad to learn about. The earlier the dgx provides enough token/s the earlier I retire here.

Let me give you an example. This project supports TP or EP. According to members of the forum, EP has no effect. So they (vllm) must be wrong. Or it depends. But if it depends why not sharing experiences?

I say it depends. The way I understand it, EP is the most useful when you use data parallelism combined with tensor parallelism, like in “big” cluster, where you use DP between nodes and TP within multiple GPUs on a node.

I ran some test before and didn’t see any performance improvements, actually decreased performance and more uneven memory utilization across nodes. Granted, I didn’t try EPLB (I believe it is a new addition), so maybe that changes things. It was also back in November, so maybe it is worth revisiting.

This is from my notes (it was before I made llama-benchy, so using vllm bench serve here):

Running gpt-oss-120b on 2 node cluster (using standard vllm build, that was before optimized mxfp4 container):

with --expert-parallel

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  20.45
Total input tokens:                      1364
Total generated tokens:                  1921
Request throughput (req/s):              0.49
Output token throughput (tok/s):         93.95
Peak output token throughput (tok/s):    156.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          160.66
---------------Time to First Token----------------
Mean TTFT (ms):                          246.84
Median TTFT (ms):                        264.44
P99 TTFT (ms):                           265.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.83
Median TPOT (ms):                        46.59
P99 TPOT (ms):                           58.16
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.55
Median ITL (ms):                         33.33
P99 ITL (ms):                            58.65
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.46
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.41
Output token throughput (tok/s):         48.38
Peak output token throughput (tok/s):    50.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          53.26
---------------Time to First Token----------------
Mean TTFT (ms):                          86.36
Median TTFT (ms):                        86.36
P99 TTFT (ms):                           86.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.11
Median TPOT (ms):                        20.11
P99 TPOT (ms):                           20.11
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.11
Median ITL (ms):                         19.99
P99 ITL (ms):                            21.72
==================================================

without --expert-parallel

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  22.21
Total input tokens:                      1364
Total generated tokens:                  2677
Request throughput (req/s):              0.45
Output token throughput (tok/s):         120.50
Peak output token throughput (tok/s):    183.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          181.90
---------------Time to First Token----------------
Mean TTFT (ms):                          253.60
Median TTFT (ms):                        271.53
P99 TTFT (ms):                           273.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.81
Median TPOT (ms):                        43.28
P99 TPOT (ms):                           53.50
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.71
Median ITL (ms):                         31.08
P99 ITL (ms):                            56.08
==================================================

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.25
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.44
Output token throughput (tok/s):         52.86
Peak output token throughput (tok/s):    54.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          58.19
---------------Time to First Token----------------
Mean TTFT (ms):                          85.98
Median TTFT (ms):                        85.98
P99 TTFT (ms):                           85.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.35
Median TPOT (ms):                        18.35
P99 TPOT (ms):                           18.35
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.35
Median ITL (ms):                         18.20
P99 ITL (ms):                            20.67
==================================================

From my profiling runs, I have gpt-oss-120b at about 93% theoretical performance bound on memory. The other 7 percent is kernel launches.

The GPU itself doesn’t do much work, it’s all in the memory bandwidth.