Hi everyone,
I’ve recently been working on CUTLASS-level kernel optimization for SM121 (GB10) and wanted to share results that might be useful for the DGX Spark community.
Background
SM121 lacks tcgen05, multicast, and 2-SM MMA — so SM100 datacenter kernels don’t run natively. The default fallback is SM80-class kernels, which leave significant performance on the table. I set out to build and benchmark proper SM120/SM121-targeted CUTLASS kernels.
Environment
- 4x DGX Spark (GB10, 128GB each)
- CUDA 13.1 + CUTLASS 4.4.0
- Nsight Compute for profiling
- 200Gbps RoCE dedicated fabric (Mikrotik CRS812 DDQ)
GEMM Benchmark Results
NVFP4 (Block-Scaled FP4)
| Problem Size | TFLOPS | Notes |
|---|---|---|
| 4096 x 14336 x 4096 | 356 | Peak measured |
| Various MoE shapes | 120-154 | Expert-dependent |
356 TFLOPS on dense NVFP4 = ~71% utilization of the 500 TFLOPS dense FP4 peak (1 PFLOPS spec includes sparsity).
FP8
| Problem Size | TFLOPS |
|---|---|
| Peak (large GEMM) | 188 |
MoE Grouped GEMM (8 & 64 experts)
| Tile Shape | N=256 | Best For |
|---|---|---|
| 256x128 | 154 TFLOPS | Prefill / large batch |
| 128x128 | ~147 TFLOPS | Decode / small batch |
Key challenge: GB10 has 101,376 bytes shared memory (same as RTX 4090). SGLang’s default MoE configs request ~147KB and fail with OutOfResources. Custom tile configs solve this.
SGLang A/B Test (GLM-4.7-FP8, 4-node TP=4)
| Metric | With Optimized Configs | Without (Default) | Delta |
|---|---|---|---|
| Throughput | 16.77 tok/s | 15.77 tok/s | +6.3% |
| Stability | ±0.01 | ±0.01 | Both stable |
Note: EAGLE speculative decoding was disabled for this comparison. With EAGLE enabled, throughput is 20-27 tok/s.
What Was Done
- Python DSL patch: Added
sm_121atoBlockScaledMmaOp.admissible_archs(CUTLASS issue #2800) - NVFP4 + FP8 GEMM kernels: Built via C++ collective builder targeting SM121, TN layout, cluster 1x1x1
- MoE tile sweep: Benchmarked tile shapes within 101KB SMEM budget, found optimal configs per workload
- SGLang MoE configs: Generated 4 config files for
NVIDIA_GB10(2 per Triton version: 3.3.0 and 3.5.0) — these didn’t exist before - Docker image: Pre-built container with all configs applied
Deliverables (Open Source)
- GitHub: GitHub - BTankut/dgx-spark-sglang-moe-configs: Optimized SGLang MoE kernel configs for NVIDIA DGX Spark (GB10) - GLM-4.7-FP8 and other MoE models
- Docker:
ghcr.io/btankut/sglang-spark-glm47:latest(includes MoE configs + patches) - Includes: config files, Dockerfile, tuning guide, benchmark data
What Was Skipped (and Why)
- Attention kernel port (Faz 4): FlashInfer 0.5.3 already works on SM121. CUTLASS attention port from example 77 is very complex with low practical ROI.
- Multi-node NCCL optimization (Faz 6): Current RoCE/RDMA setup provides adequate performance for 4-node TP=4.
Key Takeaways for GB10 Users
- SM121 does have real NVFP4/FP8 Tensor Core capability — 356 TFLOPS is proof
- The 101KB shared memory limit is the main constraint, not compute
- Default MoE configs from SGLang/vLLM will fail on GB10 — custom configs are required
- CUTLASS 4.4.0 with
sm_121atarget works — you don’t need to wait for upstream fixes
If anyone wants to reproduce or extend this work, everything is in the GitHub repo. Happy to answer questions.
Related posts: