SM121 CUTLASS Kernel Optimization Results: NVFP4 356 TFLOPS, MoE Grouped GEMM on DGX Spark

baristankut · February 7, 2026, 11:57am

Hi everyone,

I’ve recently been working on CUTLASS-level kernel optimization for SM121 (GB10) and wanted to share results that might be useful for the DGX Spark community.

Background

SM121 lacks tcgen05, multicast, and 2-SM MMA — so SM100 datacenter kernels don’t run natively. The default fallback is SM80-class kernels, which leave significant performance on the table. I set out to build and benchmark proper SM120/SM121-targeted CUTLASS kernels.

Environment

4x DGX Spark (GB10, 128GB each)
CUDA 13.1 + CUTLASS 4.4.0
Nsight Compute for profiling
200Gbps RoCE dedicated fabric (Mikrotik CRS812 DDQ)

GEMM Benchmark Results

NVFP4 (Block-Scaled FP4)

Problem Size	TFLOPS	Notes
4096 x 14336 x 4096	356	Peak measured
Various MoE shapes	120-154	Expert-dependent

356 TFLOPS on dense NVFP4 = ~71% utilization of the 500 TFLOPS dense FP4 peak (1 PFLOPS spec includes sparsity).

FP8

Problem Size	TFLOPS
Peak (large GEMM)	188

MoE Grouped GEMM (8 & 64 experts)

Tile Shape	N=256	Best For
256x128	154 TFLOPS	Prefill / large batch
128x128	~147 TFLOPS	Decode / small batch

Key challenge: GB10 has 101,376 bytes shared memory (same as RTX 4090). SGLang’s default MoE configs request ~147KB and fail with OutOfResources. Custom tile configs solve this.

SGLang A/B Test (GLM-4.7-FP8, 4-node TP=4)

Metric	With Optimized Configs	Without (Default)	Delta
Throughput	16.77 tok/s	15.77 tok/s	+6.3%
Stability	±0.01	±0.01	Both stable

Note: EAGLE speculative decoding was disabled for this comparison. With EAGLE enabled, throughput is 20-27 tok/s.

What Was Done

Python DSL patch: Added sm_121a to BlockScaledMmaOp.admissible_archs (CUTLASS issue #2800)
NVFP4 + FP8 GEMM kernels: Built via C++ collective builder targeting SM121, TN layout, cluster 1x1x1
MoE tile sweep: Benchmarked tile shapes within 101KB SMEM budget, found optimal configs per workload
SGLang MoE configs: Generated 4 config files for NVIDIA_GB10 (2 per Triton version: 3.3.0 and 3.5.0) — these didn’t exist before
Docker image: Pre-built container with all configs applied

Deliverables (Open Source)

GitHub: GitHub - BTankut/dgx-spark-sglang-moe-configs: Optimized SGLang MoE kernel configs for NVIDIA DGX Spark (GB10) - GLM-4.7-FP8 and other MoE models
Docker: ghcr.io/btankut/sglang-spark-glm47:latest (includes MoE configs + patches)
Includes: config files, Dockerfile, tuning guide, benchmark data

What Was Skipped (and Why)

Attention kernel port (Faz 4): FlashInfer 0.5.3 already works on SM121. CUTLASS attention port from example 77 is very complex with low practical ROI.
Multi-node NCCL optimization (Faz 6): Current RoCE/RDMA setup provides adequate performance for 4-node TP=4.

Key Takeaways for GB10 Users

SM121 does have real NVFP4/FP8 Tensor Core capability — 356 TFLOPS is proof
The 101KB shared memory limit is the main constraint, not compute
Default MoE configs from SGLang/vLLM will fail on GB10 — custom configs are required
CUTLASS 4.4.0 with sm_121a target works — you don’t need to wait for upstream fixes

If anyone wants to reproduce or extend this work, everything is in the GitHub repo. Happy to answer questions.

Related posts:

baristankut · February 7, 2026, 12:04pm

Update: EAGLE Speculative Decoding Requires Tuned MoE Configs on GB10

After publishing the initial results, I ran an additional test to see what happens when EAGLE speculative decoding is enabled without the optimized MoE configs.

Result: immediate crash.

OutOfResources: out of resource: shared memory,
Required: 147456, Hardware limit: 101376

EAGLE’s speculative batching triggers Triton kernel configurations that exceed GB10’s 101KB shared memory limit. Without tuned configs, the server doesn’t even start.

Full A/B/C Comparison (GLM-4.7-FP8, 4-node TP=4)

Scenario	MoE Configs	EAGLE	Result
A	✅ Optimized	❌ Off	16.77 tok/s
B	❌ Default	❌ Off	15.77 tok/s (-6.3%)
C	❌ Default	✅ On	💥 `OutOfResources` crash
Production	✅ Optimized	✅ On	20-27 tok/s

What This Means

The tuned MoE configs aren’t just a performance optimization — they enable EAGLE speculative decoding entirely on GB10. Without them:

EAGLE off: you get ~15.8 tok/s (fallback kernels work but are slower)
EAGLE on: server crashes at startup

With them:

EAGLE off: 16.77 tok/s (+6.3% from better kernel tiling)
EAGLE on: 20-27 tok/s (full speculative decoding)

That’s a ~70% throughput gap between default configs and optimized + EAGLE.

The configs and Docker image are in the GitHub repo. If you’re running any MoE model on DGX Spark with SGLang, I’d strongly recommend tuning your MoE kernels before enabling speculative decoding.

arctic.gus · February 7, 2026, 12:07pm

awesome work, many thanks for posting it, will check it out.

flash3 · February 7, 2026, 3:24pm

Impressive work as always. At this point, if NVIDIA launched a “DGX Spark Software Fitness Bounty Program,” you could practically retire. 🏖️

baristankut · February 7, 2026, 5:48pm

Thanks! Though at this rate, NVIDIA owes me more in workaround hours than the Spark itself costs. Maybe the bounty should be retroactive. 😄

eugr · February 7, 2026, 7:12pm

Looks like these tuned triton configs should be compatible with vLLM as well, have you tried it?

baristankut · February 7, 2026, 8:06pm

I haven’t tried it yet but I’m very curious.

eugr · February 7, 2026, 8:14pm

Since it is using the same Triton kernels package, should be possible. I’ll test it when I have time. The only thing is that standard Triton for vLLM is 3.5.1 now, but 3.5.0 config should work, I guess?

Also, there is an approved PR that would upgrade vLLM to torch 2.10 and triton 3.6.0 officially, so not sure how that would change things. The PR is currently failing CI checks, but I can expect it to merge any time now.

eugr · February 8, 2026, 11:28pm

Th PR has been merged

baristankut · February 9, 2026, 11:53am

That’s good news! Ok, i will try as soon as possbile.

Topic		Replies	Views
50%+ Improvement on spark?! DGX Spark / GB10 cuda , deepseek	25	2539	March 24, 2026
NVFP4 issue root cause? DGX Spark / GB10	0	261	April 4, 2026
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	955	April 16, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	38	2480	June 24, 2026
SM121 (GB10) native NVFP4 compute — seeking guidance on software support DGX Spark / GB10 cuda , kernel , nemotron	3	1018	March 25, 2026
Spark and vllm DGX Spark / GB10 Projects	0	181	April 9, 2026
Request for sm_121-tuned kernels in cuDNN/cuBLAS — DGX Spark training throughput gap DGX Spark / GB10 cudnn , cublas	4	269	May 23, 2026
MiniMax-M2.5-NVFP4 on 4× DGX Spark via SGLang TP=4 EP=4: 124 tok/s aggregate @ n=8 (fixing the CUTLASS MoE compile OOM with MAX_JOBS=1) DGX Spark / GB10 Projects cuda	0	109	June 18, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	213	7090	March 13, 2026
GB10 (SM12.1) vLLM FP8 inference — any progress on native SM12.1 kernels? DGX Spark / GB10 cublas , nemotron	3	852	March 27, 2026