Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)?

jpark.97 · February 4, 2026, 3:33am

Right now the CuTe Python DSL is still behaving like SM121 doesn’t exist:

Issue #2947 documents tcgen05/FP4 being hard-restricted to sm_100a/sm_103a, rejecting sm_121 / sm_121a.
Issue #2800 still shows BlockScaledMmaOp restricting FP4 ops to sm_100a only, blocking sm_120/sm_121.
Issue #2802 shows the same pattern: expects arch … [‘sm_100a’,‘sm_100f’], but got sm_121a in tcgen05 MMA code.

This is getting ridiculous… seriously

Please don’t respond with “use a workaround that maps SM121 to something else.” People bought Spark specifically for Blackwell features, and the community is already calling out “fixes” that disable them.

eugr · February 4, 2026, 6:11am

Well, some work is being done - see this one, for example: [Draft][Cute,Fwd,Sm120] FA Cute DSL sm12x by johnnynunez · Pull Request #2222 · Dao-AILab/flash-attention · GitHub

The core issue is that there is datacenter Blackwell sm10x (with tcgen5) and consumer Blackwell sm12x (without tcgen5). Both are marketed as Blackwell, but they are not the same.

christopher_owen · February 4, 2026, 7:58am

I fixed it in my branch, but I haven’t upstreamed it yet.

jpark.97 · February 4, 2026, 1:08pm

@eugr @christopher_owen thanks guys both of U are keeping this gb10 boat afloat. Much appreciation

johnny_nv · March 24, 2026, 10:18am

github.com/NVIDIA/cutlass

python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py

982748aa7


      
          a_dtype: Type[Numeric]
          b_dtype: Type[Numeric]
          acc_dtype: Type[Numeric]
          shape_mnk: Shape
          cta_group: CtaGroup
          a_src: OperandSource
          a_major_mode: OperandMajorMode
          b_major_mode: OperandMajorMode
          
          admissible_archs = Arch.filter(
              lambda arch: arch.is_family_of(Arch.sm_100f) or arch.is_family_of(Arch.sm_110f)
          )
          
          def __post_init__(self) -> None:
              # Verify arch
              arch = BaseDSL._get_dsl().get_arch_enum()
              if arch not in self.admissible_archs:
                  raise OpError(
                      self,
                      f"expects arch to be one of {self.admissible_archs}, but got {arch}",
                      suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",

sm12x don’t have tcgen05 1. Introduction — PTX ISA 9.2 documentation

_cjg · March 24, 2026, 11:35am

Interesting, but a little disappointing overview (for 120) …

github.com/voipmonitor/rtx6kpro

inference-engines/flashinfer.md

master

# FlashInfer on RTX 6000 Pro Blackwell (SM120)

## Table of Contents

- [What is FlashInfer](#what-is-flashinfer)
- [SM120 Backend Landscape](#sm120-backend-landscape)
- [CUTLASS Backend and SM120 Support](#cutlass-backend-and-sm120-support)
- [SM120f Family Conditional Instructions](#sm120f-family-conditional-instructions)
- [flashinfer_cudnn Workaround](#flashinfer_cudnn-workaround)
- [CUTLASS Race Condition Bug](#cutlass-race-condition-bug)
- [FA2 vs CUTLASS Performance Comparison](#fa2-vs-cutlass-performance-comparison)
- [JIT Cache Management](#jit-cache-management)
- [MLA Kernels for SM120](#mla-kernels-for-sm120)
- [Relevant PRs and Issues](#relevant-prs-and-issues)

---

## What is FlashInfer

FlashInfer is a library of GPU kernels for LLM inference, providing attention backends, MoE GEMM runners, and allreduce fusion primitives. On RTX 6000 Pro Blackwell (SM120), FlashInfer is the primary kernel library used by SGLang and partially by vLLM.

This file has been truncated. show original

vgoklani · March 24, 2026, 1:01pm

@johnny_nv What does “5th Generation Tensor Cores” mean? Is that not the same thing as tcgen05???

See the official spec sheet, third page, where is says “5th generation tensor cores”

trystan1 · March 24, 2026, 1:16pm

@johnny_nv

Can we expect a version of fp4 acceleration using the watered down tcgen05 (smaller 99kb shared memory) with cuda 13.2/driver 595 along with an updated version of cutlass?

I think the community is running out of patience with the software support and deluge of vibe coded attempts/pull requests claiming to have fixed the underlying issue.

A clear roadmap/milestone confirmation would really do wonders.

twosg · March 24, 2026, 3:54pm

I doubt we will get official support for cuda 13.2 and/or driver 595 on the Spark anytime soon, because driver 590 is still “in beta” for the Spark: Upgrading the GPU Driver from 580 to 590 on DGX Spark using CLI - #2 by aniculescu

martinB78 · March 24, 2026, 4:31pm

I click on “Update Available [Update Now]” almost every single day and happy that there are so frequent updates but the stuff that needs to be fixed seems not to be fixed.
still struggling with CUDA 13 / PyTorch ( TorchCodec / torchaudio issue ARM64) support.

johnny_nv · March 24, 2026, 5:03pm

drivers/ubuntu and ubuntu 26.04 are on our roadmap

johnny_nv · March 24, 2026, 5:04pm

all frameworks are compatible with DGX Spark, we are working on performance side

johnny_nv · March 24, 2026, 5:04pm

yes, you can see in flashinfer, cutlass, flashattention we are adding support and expose kernels to use nvfp4

jl121 · March 24, 2026, 8:34pm

I am sure that the complete community is pleased to read this, but - no offense personally!- The amount of vague information on timeline or roadmap on the implementation of the NVFP4 performance on the Blackwell Geforce series seems to be beginning to wear down any last patience within many people in the community.

I think the fact, that the apparently extreme vague specification, that the GB10 systems were sold under and hyped over, which now seems to be much different hardware than the datacenter blackwell series and the lack of a appropriate functioning software stack reflecting the sales-hype is - to be honest - slowly beginning to piss people off. (Especially the fact that the 5th. Gen tensor cores are “the cheap 5th. Gen cores”)

no doubt that the hardware is impressive and actually groundbreaking for the formfactor and power requirements, but it would be fitting to speed up the process of actually matching the effort of sales-hype to maturing the software stack.

Especially taking the massive delays, kaos in distribution and lack of favoring the early backers in consideration.

Nvidia is lucky, that did not create a regular shitstorm.

I love the hardware, expected the edge software stack, but not such a timeline and most of all lack of clear communicated information on the timeline of native software support.

I just really don’t understand the lack of official communication on the matter…

johnny_nv · March 24, 2026, 9:56pm

hello, now across frameworks they have access to dgx spark and agx thor, so, we are working with all developers to improve the support

johnny_nv · March 24, 2026, 9:59pm

tcgen05 for dummies - gau-nernst's blog one of the best dev from community

trystan1 · March 24, 2026, 11:32pm

This is great but why link a blogpost that does a deep dive on instructions we can’t access on the spark?

johnny_nv · March 25, 2026, 12:45am

I get the frustration here, but there’s a fundamental misunderstanding driving most of these issues.

SM12x (GB10 / DGX Spark / RTX 50) does not implement tcgen05, and therefore it also doesn’t support the associated FP4 Tensor Core path exposed through that ISA.

The current restrictions you’re seeing in CuTe (e.g. limiting tcgen05 ops to sm_100a / sm_110f families) are intentional, those ops map to hardware features that only exist on datacenter Blackwell (SM100/SM110), which includes:

tcgen05 MMA instructions
TMEM-backed accumulation model
blockscaled FP4 paths tied to that pipeline

SM12x is a different architecture target, and uses a different Tensor Core programming model. So mapping SM121 → SM100 just to “enable” tcgen05 would be incorrect and likely produce invalid codegen.

For DGX Spark / GB10 or sm120, FP4 support needs to go through the supported MMA pipelines for that architecture, not tcgen05. Example: cutlass/examples/python/CuTeDSL/blackwell_geforce/dense_gemm.py at main · NVIDIA/cutlass · GitHub

Happy to take a look at specific use cases if you’re trying to get FP4 working on SM121, from our side we are working on it, to fix the issues that @eugr and other communities members get, across multiple frameworks.

trystan1 · March 25, 2026, 3:11am

Ok, so why isn’t it currently across at least tensorrt-llm and vllm? Tensorrt-llm is literally your runtime, i’d expect the full capabilities of the hardware to be realized.

Happy to take a look at specific use cases if you’re trying to get FP4 working on SM121

The use cases are inference, I’d like inference to work on my nvidia inference box that’s running an nvidia quantization format with an nvidia model.

If the ‘supported mma pipelines’ on sm120 and sm121 aren’t available via any software then we’re not really using the hardware am I wrong?

serapis · March 25, 2026, 3:12am

I believe most of us are looking for concrete timelines to see substantial improvements. The Spark was promised and sold as more capable than it proves to be right now. The community tries it very best to help close that gap wherever it can and sees some performance improvements that would much benefit from additional support from Nvidia.

If tcgen is officially off the table now: what can we expect from Nvidia to help make the Spark more capable?

Topic		Replies	Views
NVFP4 on DGX Spark / GB10 is broken. I bought 9 of these for this feature. Requesting NVIDIA's official roadmap and response DGX Spark / GB10 jetson , llama , agentic-ai , nemotron , nemoclaw	46	7143	July 17, 2026
I am EXTREMely disappointed with the current state of DGX Spark DGX Spark / GB10	91	21083	June 24, 2026
DGX Spark (SM121) Software Support is Severely Lacking - Official Roadmap Needed DGX Spark / GB10	41	5748	February 15, 2026
595.58.03 Certified Linux-aarch64 (ARM64) Display Driver and CUDA 13.2 - when for DGX Spark GB10 DGX Spark / GB10 cuda , driver	22	1502	June 17, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	213	7209	March 13, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2963	March 26, 2026
To NVIDIA Staff: Stop leeching off community developers, Get your act together and start shopping fixes the broken VLLM & TensorLLM Packes DGX Spark / GB10	1	291	January 29, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2770	December 25, 2025
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	9438	March 14, 2026
Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark) DGX Spark / GB10 Projects jetson , nemotron	15	2943	April 12, 2026

Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)?

Related topics