Why 273 GB/s? Less Is More, Until It Isn’t

some progress has been made in breaking up the monolithic vLLM architecture.

[FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect - #201 by flash3]

main issues are: compute and memory feeding couldnt be separated well, because all of the computing units are fed from the same memory (and its limited bandwidth)

everything thats gonna be distributed or shared between nodes is serialized, so it takes longer than single node in principle.

the nccl is very fast, but if a remote procedure is called 90 times a 120μs, 12ms are gone. so this operation has 92x per seconds as an upper limit, every further compute time lowers this.

the vllm is full if this. vllm is a framework for setting up and run local (gpu on same host) models. some extensions try to scale to other gpus, and furter more via network. but finally its not orchestrated well. its not made for dgx.

dgx is a complementary design to every cuda/ai approach before.

first users try to sell their dgx because they couldnt figure out how to use it. the dead horse marketing (nvp4) doesn’t help either.

just to give you a specific vision: you know that memory is limited for gpu and cpu totgether. so you run the linux system with X and a browser on it. why not having a youtube video running while you’re doings some ai research?

  Rank 0 (DGX Spark)                             Rank 1 (PGX ThinkStation)
  ==================                             ========================

  ┌─────────────────────┐                        ┌─────────────────────┐
  │ input_layernorm     │ ~2µs                   │ input_layernorm     │
  │ (RMSNorm)           │                        │ (RMSNorm)           │
  └─────────┬───────────┘                        └─────────┬───────────┘
            │                                              │
  ┌─────────▼───────────┐                        ┌─────────▼───────────┐
  │ QKV Projection      │ ~50µs (Marlin INT4)    │ QKV Projection      │
  │ (ColumnParallel)    │ KEIN Collective        │ (ColumnParallel)    │
  │ Attention Compute   │ ~30µs                  │ Attention Compute   │
  │ o_proj GEMM         │ ~20µs (RowParallel)    │ o_proj GEMM         │
  └─────────┬───────────┘                        └─────────┬───────────┘
            │ partial_sum_0                                │ partial_sum_1
            │                                              │
  ══════════╪══════════════════════════════════════════════╪════════════
  ║         └──────── NCCL AllReduce #1 (o_proj) ─────────┘           ║
  ║              GEMESSEN: 18µs raw, 46µs in CUDA Graph               ║
  ║   ┌──────────────────────────────────────────────────┐            ║
  ║   │ RDMA Write:    Rank0 → Rank1 partial_sum_0       │ ~2-4µs    ║
  ║   │ RDMA Write:    Rank1 → Rank0 partial_sum_1       │ ~2-4µs    ║
  ║   │ GPU Reduce:    result = partial_0 + partial_1    │ ~1µs      ║
  ║   │ NCCL Protocol: LL Handshake + Flags              │ ~5-10µs   ║
  ║   │ Graph Replay Overhead (wenn in CUDA Graph):      │ ~28µs     ║
  ║   └──────────────────────────────────────────────────┘            ║
  ══════════╪══════════════════════════════════════════════╪════════════
            │ reduced = p0 + p1                            │
            │                                              │
  ┌─────────▼───────────┐                        ┌─────────▼───────────┐
  │ post_attn_layernorm │ ~5µs                   │ post_attn_layernorm │
  │ fused_add_rms_norm  │ (residual add + norm)  │ fused_add_rms_norm  │
  └─────────┬───────────┘                        └─────────┬───────────┘
            │                                              │
  ┌─────────▼───────────┐                        ┌─────────▼───────────┐
  │ MoE Router          │ ~5µs (repliziert)      │ MoE Router          │
  │ FusedMoE Kernel     │ ~100µs (Marlin INT4)   │ FusedMoE Kernel     │
  │ (RowParallel)       │                        │ (RowParallel)       │
  └─────────┬───────────┘                        └─────────┬───────────┘
            │ partial_sum_0                                │ partial_sum_1
            │                                              │
  ══════════╪══════════════════════════════════════════════╪════════════
  ║         └──────── NCCL AllReduce #2 (MoE out) ────────┘           ║
  ║              GEMESSEN: 18µs raw, 46µs in CUDA Graph               ║
  ══════════╪══════════════════════════════════════════════╪════════════
            │                                              │
            ▼ → nächster Layer input_layernorm             ▼

current measurments… cuda graph is heavy…

1 Like

You are using this, the new way to launch in vllm --distributed-executor-backend external_launcher

work in progress but yes. there are still some blockers but finally it shows that TP=2 can be faster than TP=1 if you mod the vllm.

an eager-allreduce-run showed 198 token/s (tp=2, zero context, Qwen3-Coder-30B-A3B INT4 AutoRound, EAGLE3 NST=1), but its too early.

swap-laboratories/moe-configs at main · vedcsolution/swap-laboratories · GitHub Some Moe configurations for Intel_Qwen3.5-122B-A10B-int4-AutoRound, Qwen_Qwen3.5-122B-A10B-FP8, dgx spark, the tuned ones ran in ray, I don’t know if they need to be updated with this new launch method, –distributed-executor-backend external_launcher

tp=2 is already faster than tp=1 without modding vLLM. Depends on the model, of course, models with small number of active parameters don’t scale as well as large dense models, but we’ve been using tensor parallel to improve inference speeds on Sparks since November…

Sure, it depends. There is a scaling of ~1/2 of the token time that competes with the latency and wait cycles beyond NCCL, and mostly it compensates — sometimes fully saturated, the won time offsets the wait time.