Why 273 GB/s? Less Is More, Until It Isn’t

Over the past decade, several well-established research directions have argued that “less is more”:

  1. Regularization theory shows that constraining models often improves generalization. Techniques such as weight decay, dropout, and early stopping deliberately reduce effective capacity—and frequently lead to better real-world performance.

  2. The Lottery Ticket Hypothesis demonstrates that large neural networks contain much smaller subnetworks that can match or even outperform the full model, provided they are trained under the right conditions.

  3. Sparse activation and low-precision inference studies suggest that only a fraction of parameters are active per token, and that 4-bit (or lower) representations can converge without catastrophic loss in model quality.

Taken together, these results are often summarized under a familiar principle: less is more.

What do all these theses have in common—and what do they have in common with the DGX Spark?

They are increasingly used to justify systems where “less” is no longer a choice, but a hard constraint.

You can almost picture the product and marketing teams sitting together like at a card table, each round betting that sparsity, quantization, or theory will compensate for one more cut.

In the end, the result has to be polished and framed carefully so that it still looks like a coherent product.

The practical outcome, however, is a machine that struggles to run large models—not because the parameters don’t fit, but because memory bandwidth becomes the dominant bottleneck. Agent-style workloads with context barely survive the prefill phase and then drift, not due to emergent behavior, but because execution slows down enough to break the flow.

It’s a bit like riding a bicycle: As long as you stay in motion, you keep your line. Slow down too much, and you fall over.

Of course, one can always say: “They must have had a reason.” - And yes—they did. Just often a very different one from what practitioners actually need.

Whether this is a valid interpretation of “less is more,” or simply the DGX Spark—a glimmer of hope that these theories will not only hold true in practice, but that the combined ingenuity, workarounds, and thirst for knowledge of its users will turn necessity into a virtue through self-discovery—remains an open question.

You can’t start a fire without a spark—but can you with the DGX Spark?

Not sure this is content suited for dev forums.

1 Like

Sure — the post tries to answer in advance. A simple question would
indeed fit much better. So, do you know why?

LLM Inference Speed Formula: Finding the Sweet Spot …

The Formula

tok/s = β / (W + KV)
Variable Description
β Memory bandwidth (GB/s)
W Active weights = P_active × bits / 8 (GB)
KV KV-cache = context_length × k (GB)

That’s it. Everything else is just plugging in numbers.

DGX Spark Analysis (273 GB/s)

Let’s find the sweet spot for NVIDIA’s DGX Spark with 273 GB/s unified memory bandwidth.

Target: 10 tok/s (minimum usable for interactive work)

Rearranging the formula:

W + KV = β / tok/s
W + KV = 273 / 10 = 27.3 GB

Sweet Spot Table (Q4 quantization)

Active Params W (Q4) Remaining for KV Max Context @ 10 tok/s
3B 1.5 GB 25.8 GB ~400K tokens
7B 3.5 GB 23.8 GB ~300K tokens
14B 7 GB 20.3 GB ~180K tokens
32B 16 GB 11.3 GB ~70K tokens
70B 35 GB -7.7 GB ❌ impossible

The Convergence Point

For DGX Spark at 10 tok/s usable speed:

Max active params (Q4) ≈ 27B with zero context

The sweet spot: ~14B active parameters (Q4) with ~128K context

This means MoE models like:

  • GLM-4.7-Flash (30B total, 3B active) → excellent fit, 400K+ context possible
  • Mixtral 8x22B (141B total, ~39B active) → too large
  • Qwen2.5-32B (dense) → borderline, ~50K context max

Is 273 GB/s a Convergence Marker?

Looking at available hardware:

Device Bandwidth Price Sweet Spot (Q4, 10 tok/s)
MacBook Pro M3 Max 400 GB/s $3,500 ~40B active, 100K ctx
DGX Spark 273 GB/s $3,000 ~27B active, 70K ctx
Mac Studio M2 Ultra 800 GB/s $4,000 ~80B active, 200K ctx

273 GB/s seems to be the minimum viable bandwidth for serious local LLM inference with long context.

Below this (DDR5 at 140-280 GB/s), you’re stuck with either:

  • Small models only, or
  • Unusable speeds at long context

Above this (800+ GB/s), you enter “local GPT-4 class” territory.

3 Likes

Real-World Validation: 4x DGX Spark Cluster vs RTX 3090 Cluster — Why End-to-End Data Path Wins

Great analysis @flash3. Your formula perfectly captures what I’ve been experiencing in practice. Let me add some real-world data from my 4x DGX Spark cluster running GLM-4.7 FP8.

My Setup

  • 4x DGX Spark (512 GB total unified memory)

  • MikroTik CRS812-8DS-2DQ-2DDQ-RM switch

  • ConnectX-7 RDMA interconnect between nodes

  • Running GLM-4.7 FP8 — getting 25 tok/s on single calls, consistently

The Bandwidth Trap Everyone Falls Into

When I was evaluating hardware, everyone — including AI assistants — told me “just get 8x RTX 3090s, you’ll have 7.5 TB/s total memory bandwidth vs your pathetic 273 GB/s per node.”

On paper, they’re right. A single RTX 3090 has 936 GB/s. But here’s what paper specs don’t tell you:

GPU VRAM (936 GB/s)
    ↓
PCIe 4.0 x16 (32 GB/s)      ← 29x bottleneck
    ↓
System RAM
    ↓
NIC buffer
    ↓
Network switch               ← another bottleneck
    ↓
Reverse path to next node

That 936 GB/s becomes ~5-10 GB/s effective end-to-end in a multi-node RTX cluster.

Meanwhile, DGX Spark:

Unified Memory (273 GB/s, zero transfer overhead)
    ↓
ConnectX-7 RDMA (CPU bypass, 25 GB/s)
    ↓
Clean switch (microsecond latency)
    ↓
Direct to next node

273 GB/s stays ~20-23 GB/s effective end-to-end.

The Real Metric: End-to-End Data Path Efficiency

Metric 4x DGX Spark 18x RTX 3090 (3 nodes)
Spec bandwidth (per unit) 273 GB/s 936 GB/s
Effective end-to-end ~20-23 GB/s ~5-10 GB/s
Efficiency ratio ~8% ~0.8%
tok/s (GLM-4.7 FP8) 25 ~8-12 (estimated)
Power consumption ~400W ~6,300W
Noise level Office-quiet Separate room required
Total cost $16,000 ~$31,700
Annual electricity (24/7) ~$350 ~$5,500

Why the Switch Matters

I’m using a MikroTik CRS812-8DS-2DQ-2DDQ-RM (CRS812 DDQ). Most people overlook switch selection, but it’s critical:

  • 2x 400G QSFP56-DD + 2x 200G QSFP56 + 8x 50G SFP56 — massive headroom

  • Marvell 98DX7335 switch chip — hardware-level cut-through switching, no software overhead

  • Non-blocking architecture — every port sustains full throughput simultaneously

  • Microsecond-level latency — enterprise switches are often overengineered with unnecessary latency

  • Dual hot-swap PSU + 4x hot-swap fans — production-grade reliability

  • $1,295 — a fraction of comparable enterprise 400G switch pricing

With ConnectX-7 doing RDMA (bypassing CPU entirely) into a non-blocking switch with up to 400G capacity, the inter-node communication is as close to wire speed as you can get. And when your cluster grows, you have 400G uplink headroom waiting — no forklift upgrade needed.

Confirming Your Convergence Theory

Your observation that 273 GB/s is a “minimum viable bandwidth” is spot on. But I’d add a nuance:

It’s not just about bandwidth — it’s about how much of that bandwidth actually reaches the compute.

  • 273 GB/s with 100% availability (unified memory) > 936 GB/s with ~1% effective delivery (PCIe + network bottlenecks)

  • The DGX Spark’s “weakness” (LPDDR5x bandwidth) is actually irrelevant for MoE models where active parameters are small

  • ConnectX-7 with RDMA ensures inter-node transfers don’t waste cycles on CPU overhead

Bottom Line

Stop comparing spec sheets. Start tracing the actual data path from weight storage to compute and back. The system with the cleanest, lowest-latency end-to-end path wins — not the one with the biggest number on the box.

For MoE inference at scale, 4x DGX Spark is not a compromise. It’s the engineering-optimal solution.

1 Like

The smem is very fast in this chip, which there is ~100kb. In my limited experience, most of the time is spent loading from memory to make it available to the processor… about 99% of the time.

In my project some of my short term goals are trying to find ways to manage that (and kernel overhead) to improve performance on our devices.

the number of available registers and amount of smem is making it hard for me to pipeline two requests at the same time like is possible on the bigger hardware.

Very good. I like Marvel chips in Mikrotik too.

And yes, you’re right about making the thesis more specific. If you could load-balance active MoE experts or even affected weights across devices, this could unlock superscalar inference on/by DGX.

But no, this isn’t about reading specs – it’s a report of disappointment. I was surprised how slow Claude Code with GLM-4.7-Flash actually is. The problem: tooling workflows hit 17K+ context, and at 273 GB/s, that’s ~1 tok/s. DGX Spark is built for chat, not agentic workloads. RTX pro 6000 looks like from another world in comparison.

… next thesis:

Hot Expert Replication for Multi-DGX MoE Inference

The Problem

MoE models like GLM-4.7 have 64+ experts, but only 2-4 are activated per token. The activation is not uniformly distributed – some experts are called far more frequently than others.

Expert Activation Frequency (typical):

Expert 3:  ████████████████████████ 12%
Expert 7:  ██████████████████████   11%
Expert 12: ████████████████████     10%
Expert 15: ██████████████████       9%
Expert 1:  ████████████             6%
...
Expert 58: █                        0.3%
Expert 61: █                        0.2%

The 80/20 rule applies: ~20% of experts handle ~80% of activations.

What is Hot Expert Replication?

Core idea: Frequently used “hot” experts are replicated on all nodes. Only rarely used “cold” experts are sharded.

Without Replication (Naive Sharding)

┌─────────────┐         ┌─────────────┐
│   DGX 0     │         │   DGX 1     │
│             │         │             │
│ Expert 0-31 │◄─QSFP56─▶│ Expert 32-63│
│             │         │             │
└─────────────┘         └─────────────┘

Token needs Expert 3 and Expert 45:
- Expert 3:  Local on DGX 0 ✓
- Expert 45: Remote on DGX 1 ✗ → Network transfer required!

Problem: ~50% of activations require remote access.

With Hot Expert Replication

┌─────────────┐         ┌─────────────┐
│   DGX 0     │         │   DGX 1     │
│             │         │             │
│ Hot: 0-15   │◄─QSFP56─▶│ Hot: 0-15   │  ← Replicated!
│ Cold: 16-39 │         │ Cold: 40-63 │  ← Sharded
│             │         │             │
└─────────────┘         └─────────────┘

Token needs Expert 3 and Expert 45:
- Expert 3:  Hot → Local on both ✓
- Expert 45: Cold → Remote on DGX 1 ✗

But: Expert 3 is called 50× more often than Expert 45!
→ 90%+ of activations stay local.

Why This Works

The Math

Assumptions:
- 64 experts total
- 16 hot experts (25%) → replicated on all nodes
- 48 cold experts (75%) → evenly sharded
- Hot experts cover 80% of activations

With 2 nodes:
- Hot expert activation: 80% → always local
- Cold expert activation: 20% → 50% chance remote = 10% remote

Result: Only 10% remote calls instead of 50%!

Memory Overhead

Without replication (2 nodes):
- Each node: 32 experts = 90 GB (Q4)

With hot replication (2 nodes):
- Each node: 16 hot + 24 cold = 40 experts = 112 GB (Q4)

Overhead: +24% memory for 5× less network traffic

Why 1× or 2× QSFP56?

Bandwidth Analysis

Interconnect Bandwidth Latency
1× QSFP56 25 GB/s ~1-2 µs
2× QSFP56 (bonded) 50 GB/s ~1-2 µs
NVLink (comparison) 900 GB/s ~300 ns

Scenario: GLM-4.7 with 2 Nodes

Hidden state transfer (pipeline):

Size: 8 KB per token
1× QSFP56: 8 KB / 25 GB/s = 0.3 µs ✓ Negligible

Cold expert remote call:

Send activation: ~64 KB
Receive expert output: ~64 KB
Total: ~128 KB per remote expert call

1× QSFP56: 128 KB / 25 GB/s = 5 µs
2× QSFP56: 128 KB / 50 GB/s = 2.5 µs

Worst case (without hot replication):

2 experts per layer × 60 layers × 50% remote = 60 remote calls

1× QSFP56: 60 × 5 µs = 300 µs per token
2× QSFP56: 60 × 2.5 µs = 150 µs per token

With hot replication (10% remote):

60 layers × 10% remote = 6 remote calls

1× QSFP56: 6 × 5 µs = 30 µs per token ✓
2× QSFP56: 6 × 2.5 µs = 15 µs per token ✓

Interconnect Recommendation

Scenario 1× QSFP56 2× QSFP56 Recommendation
Pipeline only ✓ Sufficient Overkill
Expert sharding (naive) ✗ Too slow ✗ Too slow Not feasible
Hot replication ✓ Good ✓✓ Better 1× sufficient, 2× for headroom

1× QSFP56 is sufficient when hot expert replication is implemented. 2× QSFP56 provides headroom for:

  • Larger models
  • More nodes (3-4 DGX)
  • Burst traffic from uneven activation patterns

Required Software Stack Changes

Current State: vLLM / SGLang

# Current MoE implementation (simplified)
class MoELayer:
    def forward(self, hidden_states):
        router_logits = self.router(hidden_states)
        expert_indices = router_logits.topk(k=2)
        
        # All experts computed locally
        for expert_id in expert_indices:
            output += self.experts[expert_id](hidden_states)
        
        return output

Problem: No concept of “hot” vs “cold” experts, no distributed expert management.

Required Changes

  1. Expert Profiling & Classification

  2. Distributed Expert Manager

  3. RDMA Communicator for QSFP56

  4. Distributed MoE Layer

… not that far …

Summary: Where Does Multi-DGX Get Faster?

The speedup comes from aggregated memory bandwidth, not the network.

Setup RAM Bandwidth tok/s @ 17K
1× DGX Spark 273 GB/s ~1
2× DGX Spark 546 GB/s ~2
3× DGX Spark 819 GB/s ~3

Pipeline Parallelism: Each node processes only its layers → half the data per node → 2× effective bandwidth.

Hot Expert Replication: Keeps 90% of expert activations local → prevents network from becoming the bottleneck.

QSFP56: Only carries the “baton” (8 KB hidden states between layers) + rare cold expert calls.

Speedup = Nodes × (1 - Network Overhead)

With hot replication:     2× DGX = 2 × 0.95 = 1.9× faster
Without hot replication:  2× DGX = 2 × 0.50 = 1.0× (no gain)

Multi-DGX scales memory bandwidth. Hot replication ensures the network doesn’t kill that scaling. QSFP56 is just the relay – the race is won in RAM.

Hi @flash3,

The locality intuition makes sense, but the scaling conclusion seems to rest on two assumptions that need validation:

  • Hot expert” stability: Most MoE routers enforce load-balancing and capacity limits, and expert usage can vary strongly with prompt/domain. Maybe having something like routing histograms showing a stable top-N expert set across realistic inference workloads?

  • EP communication model: The EP model, on my understanding, seems to assume many small remote expert calls, whereas most implementations batch tokens per layer into grouped all-to-all exchanges

Let me know your thoughts

I agree on both points. With some distance, all-to-all (batched) seems feasible and is obviously simpler to implement.

And yes, activation behavior deserves more research (a prediction model? ). A lightweight predictor based on histograms seems like a solid idea – different usage patterns lead to different hot/cold classifications and different distributions.

Those who prompt well, use bandwidth best. I think this approach beats leaving everything to chance.

And the best: Even a dumb predictor can’t hurt – it can only fail.

The best predictor patterns might come directly from the model trainers.

During training, they track router behavior for load balancing:

  • Expert activation frequencies across millions of samples
  • Activation patterns per domain (code/math/chat)
  • Expert co-activation correlations
  • Routing stability metrics

They have all the data. They could publish it.

A very simple addition to model cards:

{
    "model": "GLM-4.7",
    "expert_profiles": {
        "code": {"hot": [3, 7, 12, 15], "coverage": 0.82},
        "chat": {"hot": [5, 9, 14, 31], "coverage": 0.78},
        "math": {"hot": [2, 8, 15, 27], "coverage": 0.85}
    },
    "domain_overlap": 0.45
}

Well, the KV cache calculation is much more complex that that and depends on model arch/topology and attention backend. Also, MoE models will perform slightly worse than equivalent dense models with the same active parameter count, because there is some extra overhead. But as a rule of thumb, it’s good enough other than KV calculation.

summarizing…

Holding experts vs. executing experts

Strategy Memory cost Execution parallelism Bottleneck
(1) Full replication N × experts per node Perfect: each node executes 1/n of active experts None – guaranteed balance
(2) Frequency-based distribution Partial replication of hot experts Depends on prediction accuracy Node with most concurrent experts
(3) Unique sharding Minimal – each expert exists once Pure luck Worst-case: all active experts on one node

The planning formula:

Max parallelism = number of concurrent active experts

Strategy (1): Parallelism = active_experts × cluster_nodes
Strategy (2): Parallelism = f(prediction_accuracy, replication_factor)
Strategy (3): Parallelism = 1 to active_experts (random)

For capacity planning:

If a model activates k experts per token out of E total, and you have N nodes:

  • (1) Full replication: Each node holds all E experts. Each node executes k/N experts per token. Linear scaling. Memory cost: N × E.

  • (2) Hot/cold split: Hot experts (say top 20%) replicated everywhere, cold experts sharded. Works if hot experts cover most activations. Memory cost: N × 0.2E + 0.8E.

  • (3) Unique sharding: Each expert on exactly one node. Memory cost: E total. Execution becomes a queuing problem – if 2 of k active experts land on the same node, that node is 2× loaded.

For luck planning:

  • it’s always murphys law.

Example:

GLM 4.7 has 8 routed + 1 shared = 9 experts active per token,
GLM 4.7 needs 2 dgx in 4 Bit

Tradeoff:

Setup Nodes Memory Cost Latency
Full replication 18 (9×2) 9× Model Optimal – every expert local
Hot/cold split 6-10 ~2-3× Model Near-optimal
Unique sharding 2 1× Model Luck – sequential if poorly distributed

Full replication could be reduced by the amount of experts that can be run for useful token/s. So you don’t need 9 node pairs if you can’t actually use that parallelism.

The real constraint:

Useful parallelism = min(active_experts, throughput_bottleneck)

If your target is, say, 30 tokens/s and a single node pair can already deliver 40 tokens/s with sequential expert execution – then replicating for parallel experts buys you nothing.

So the formula becomes:

Required node pairs = ceil(target_tokens_per_sec / single_pair_throughput)

Not:

Required node pairs = number_of_active_experts

Example:

Single pair throughput Target Node pairs needed
10 tok/s 30 tok/s 3
15 tok/s 30 tok/s 2
40 tok/s 30 tok/s 1

The 9 active experts only matter if expert execution is actually your bottleneck. If memory bandwidth, attention, or network is the limit – more expert parallelism doesn’t help.

I’ve implemented a histogram stack into vllm. lets see who is the star in experts hood…

If you are keeping track of your LLM calls could be worth using that top optimize MoEs. You could also use that for Cerebras REAP lib that is often used to cut down total params. You can technically get MinMax running if you cut 50% of the experts and go with q4. If you use your own data for expert activation analysis you could retain most of the performance that is relevant to you

2 Likes

Clearly not all candles on the christmas tree light up equally often. Still too coarse for my taste though — next step is adding co-activation tracking to see which experts fire together.
Ideas welcome.

one claude code’s ‘/init’ later … <0.02 => ‘.’

This is very interesting to me. Is your code somewhere?

GLM 4.7 Flash 30B FP8 on DGX, claude code /init and some architectural analysing tasks later…

The distribution is not chaotic, but it is also not strongly clustered — it’s a typical MoE pattern with soft routing:

  • Max/Min = 11× – the strongest pair (50↔61) fires only ~11× more often than the weakest (21↔54). That’s a relatively moderate spread.

  • 81% of pairs are at 10–30% of the max – the bulk is “lukewarm.” There’s no sharp separation into “always together” vs. “never together.”

  • Only 3 pairs above 90%: 50↔61, 14↔57, 4↔6 – true “buddy pairs.”

  • Experts 60, 21, and 54 appear frequently among the weakest pairs → more like “lone wolves.”

Conclusion: GLM-4.7-Flash has a fairly well-balanced router. Most experts are combined with many different partners. There are a few preferred pairs, but no strong isolated clusters. For EPLB (Expert Load Balancing), this means that buddy pairs like 50↔61 should be placed on different nodes, but overall the load is well distributed.

It’s much more about splitting the hot buddies than about identifying hot vs. cold experts only.

Thesis: All to all, but well placed:

| Nodes | Cross-Node Score | Top 10 split | Top 20 split | Top 50 split |
|-------|------------------|--------------|--------------|--------------|
| 2     | 53.9%            | 4 / 10       | 12 / 20      | 33 / 50      |
| 4     | 80.5%            | 10 / 10      | 20 / 20      | 48 / 50      |

With 4 nodes, all Top-20 buddy pairs are placed on different nodes — ideal for parallelization.
With 2 nodes, it is mathematically impossible to separate all of them due to too many dependencies.

hmm …

it will be public after some cleanups.

| Scenario              | Tier 1 4/4        | Tier 2 4/4        | Replicas | VRAM Overhead |
|-----------------------|-------------------|-------------------|----------|---------------|
| k=2 base              | 46/46 (2/2)       | 90/92 (2/2)       | 0        | 0%            |
| k=2 --multi --max 48  | 46/46 (2/2)       | 92/92 (2/2)       | 25       | +39%          |
| k=4 base              | 32/46             | 47/92             | 0        | 0%            |
| k=4 --multi --max 24  | 40/46 (87%)       | 68/92 (81%)       | 32       | +50%          |
| k=4 --multi --max 32  | 42/46 (91%)       | 80/92 (89%)       | 59       | +92%          |

Key Takeaways:

  • k=2 + 25 replicas: All 92 tuples are perfectly split 2:2.
    25 duplicated experts mean 39% more VRAM, but 100% parallelization.

  • k=4 + max 32:
    91% of Tier-1 tuples achieve 4/4, the rest still 3/4 (never worse).
    However, this costs almost double the VRAM (59 replicas).

  • k=4 + max 24 is the sweet spot:
    87% perfect at only 50% VRAM overhead.

k = dgx nodes

max = maximum experts per node

tiers: 1 >90% , 2 > 20% (see distribution graph)

1:1:1:1 means each node runs 1 expert per tupel. GLM 4.7 flash uses 4 experts
2:1:1 one node has to run 2 experts per token

and this is with hot/cold approach. splitting the hot buddies…

(⌐■_■)

next step: more use cases. more prompts and longer tasks.