SM121 CUTLASS Kernel Optimization Results: NVFP4 356 TFLOPS, MoE Grouped GEMM on DGX Spark

Hi everyone,

I’ve recently been working on CUTLASS-level kernel optimization for SM121 (GB10) and wanted to share results that might be useful for the DGX Spark community.

Background

SM121 lacks tcgen05, multicast, and 2-SM MMA — so SM100 datacenter kernels don’t run natively. The default fallback is SM80-class kernels, which leave significant performance on the table. I set out to build and benchmark proper SM120/SM121-targeted CUTLASS kernels.

Environment

  • 4x DGX Spark (GB10, 128GB each)
  • CUDA 13.1 + CUTLASS 4.4.0
  • Nsight Compute for profiling
  • 200Gbps RoCE dedicated fabric (Mikrotik CRS812 DDQ)

GEMM Benchmark Results

NVFP4 (Block-Scaled FP4)

Problem Size TFLOPS Notes
4096 x 14336 x 4096 356 Peak measured
Various MoE shapes 120-154 Expert-dependent

356 TFLOPS on dense NVFP4 = ~71% utilization of the 500 TFLOPS dense FP4 peak (1 PFLOPS spec includes sparsity).

FP8

Problem Size TFLOPS
Peak (large GEMM) 188

MoE Grouped GEMM (8 & 64 experts)

Tile Shape N=256 Best For
256x128 154 TFLOPS Prefill / large batch
128x128 ~147 TFLOPS Decode / small batch

Key challenge: GB10 has 101,376 bytes shared memory (same as RTX 4090). SGLang’s default MoE configs request ~147KB and fail with OutOfResources. Custom tile configs solve this.

SGLang A/B Test (GLM-4.7-FP8, 4-node TP=4)

Metric With Optimized Configs Without (Default) Delta
Throughput 16.77 tok/s 15.77 tok/s +6.3%
Stability ±0.01 ±0.01 Both stable

Note: EAGLE speculative decoding was disabled for this comparison. With EAGLE enabled, throughput is 20-27 tok/s.

What Was Done

  1. Python DSL patch: Added sm_121a to BlockScaledMmaOp.admissible_archs (CUTLASS issue #2800)
  2. NVFP4 + FP8 GEMM kernels: Built via C++ collective builder targeting SM121, TN layout, cluster 1x1x1
  3. MoE tile sweep: Benchmarked tile shapes within 101KB SMEM budget, found optimal configs per workload
  4. SGLang MoE configs: Generated 4 config files for NVIDIA_GB10 (2 per Triton version: 3.3.0 and 3.5.0) — these didn’t exist before
  5. Docker image: Pre-built container with all configs applied

Deliverables (Open Source)

What Was Skipped (and Why)

  • Attention kernel port (Faz 4): FlashInfer 0.5.3 already works on SM121. CUTLASS attention port from example 77 is very complex with low practical ROI.
  • Multi-node NCCL optimization (Faz 6): Current RoCE/RDMA setup provides adequate performance for 4-node TP=4.

Key Takeaways for GB10 Users

  1. SM121 does have real NVFP4/FP8 Tensor Core capability — 356 TFLOPS is proof
  2. The 101KB shared memory limit is the main constraint, not compute
  3. Default MoE configs from SGLang/vLLM will fail on GB10 — custom configs are required
  4. CUTLASS 4.4.0 with sm_121a target works — you don’t need to wait for upstream fixes

If anyone wants to reproduce or extend this work, everything is in the GitHub repo. Happy to answer questions.


Related posts:

Update: EAGLE Speculative Decoding Requires Tuned MoE Configs on GB10

After publishing the initial results, I ran an additional test to see what happens when EAGLE speculative decoding is enabled without the optimized MoE configs.

Result: immediate crash.

OutOfResources: out of resource: shared memory,
Required: 147456, Hardware limit: 101376

EAGLE’s speculative batching triggers Triton kernel configurations that exceed GB10’s 101KB shared memory limit. Without tuned configs, the server doesn’t even start.

Full A/B/C Comparison (GLM-4.7-FP8, 4-node TP=4)

Scenario MoE Configs EAGLE Result
A ✅ Optimized ❌ Off 16.77 tok/s
B ❌ Default ❌ Off 15.77 tok/s (-6.3%)
C ❌ Default ✅ On 💥 OutOfResources crash
Production ✅ Optimized ✅ On 20-27 tok/s

What This Means

The tuned MoE configs aren’t just a performance optimization — they enable EAGLE speculative decoding entirely on GB10. Without them:

  • EAGLE off: you get ~15.8 tok/s (fallback kernels work but are slower)
  • EAGLE on: server crashes at startup

With them:

  • EAGLE off: 16.77 tok/s (+6.3% from better kernel tiling)
  • EAGLE on: 20-27 tok/s (full speculative decoding)

That’s a ~70% throughput gap between default configs and optimized + EAGLE.

The configs and Docker image are in the GitHub repo. If you’re running any MoE model on DGX Spark with SGLang, I’d strongly recommend tuning your MoE kernels before enabling speculative decoding.

awesome work, many thanks for posting it, will check it out.

Impressive work as always. At this point, if NVIDIA launched a “DGX Spark Software Fitness Bounty Program,” you could practically retire. 🏖️

Thanks! Though at this rate, NVIDIA owes me more in workaround hours than the Spark itself costs. Maybe the bounty should be retroactive. 😄

Looks like these tuned triton configs should be compatible with vLLM as well, have you tried it?

I haven’t tried it yet but I’m very curious.

Since it is using the same Triton kernels package, should be possible. I’ll test it when I have time. The only thing is that standard Triton for vLLM is 3.5.1 now, but 3.5.0 config should work, I guess?

Also, there is an approved PR that would upgrade vLLM to torch 2.10 and triton 3.6.0 officially, so not sure how that would change things. The PR is currently failing CI checks, but I can expect it to merge any time now.

Th PR has been merged

That’s good news! Ok, i will try as soon as possbile.