Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)?

Right now the CuTe Python DSL is still behaving like SM121 doesn’t exist:

  • Issue #2947 documents tcgen05/FP4 being hard-restricted to sm_100a/sm_103a, rejecting sm_121 / sm_121a.

  • Issue #2800 still shows BlockScaledMmaOp restricting FP4 ops to sm_100a only, blocking sm_120/sm_121.

  • Issue #2802 shows the same pattern: expects arch … [‘sm_100a’,‘sm_100f’], but got sm_121a in tcgen05 MMA code.

This is getting ridiculous… seriously

Please don’t respond with “use a workaround that maps SM121 to something else.” People bought Spark specifically for Blackwell features, and the community is already calling out “fixes” that disable them.

Well, some work is being done - see this one, for example: [Draft][Cute,Fwd,Sm120] FA Cute DSL sm12x by johnnynunez · Pull Request #2222 · Dao-AILab/flash-attention · GitHub

The core issue is that there is datacenter Blackwell sm10x (with tcgen5) and consumer Blackwell sm12x (without tcgen5). Both are marketed as Blackwell, but they are not the same.

I fixed it in my branch, but I haven’t upstreamed it yet.

@eugr @christopher_owen thanks guys both of U are keeping this gb10 boat afloat. Much appreciation

sm12x don’t have tcgen05 1. Introduction — PTX ISA 9.2 documentation

Interesting, but a little disappointing overview (for 120) …

@johnny_nv What does “5th Generation Tensor Cores” mean? Is that not the same thing as tcgen05???

See the official spec sheet, third page, where is says “5th generation tensor cores”

@johnny_nv

Can we expect a version of fp4 acceleration using the watered down tcgen05 (smaller 99kb shared memory) with cuda 13.2/driver 595 along with an updated version of cutlass?

I think the community is running out of patience with the software support and deluge of vibe coded attempts/pull requests claiming to have fixed the underlying issue.

A clear roadmap/milestone confirmation would really do wonders.

I doubt we will get official support for cuda 13.2 and/or driver 595 on the Spark anytime soon, because driver 590 is still “in beta” for the Spark: Upgrading the GPU Driver from 580 to 590 on DGX Spark using CLI - #2 by aniculescu

I click on “Update Available [Update Now]” almost every single day and happy that there are so frequent updates but the stuff that needs to be fixed seems not to be fixed.
still struggling with CUDA 13 / PyTorch ( TorchCodec / torchaudio issue ARM64) support.

drivers/ubuntu and ubuntu 26.04 are on our roadmap

all frameworks are compatible with DGX Spark, we are working on performance side

yes, you can see in flashinfer, cutlass, flashattention we are adding support and expose kernels to use nvfp4

I am sure that the complete community is pleased to read this, but - no offense personally!- The amount of vague information on timeline or roadmap on the implementation of the NVFP4 performance on the Blackwell Geforce series seems to be beginning to wear down any last patience within many people in the community.

I think the fact, that the apparently extreme vague specification, that the GB10 systems were sold under and hyped over, which now seems to be much different hardware than the datacenter blackwell series and the lack of a appropriate functioning software stack reflecting the sales-hype is - to be honest - slowly beginning to piss people off. (Especially the fact that the 5th. Gen tensor cores are “the cheap 5th. Gen cores”)

no doubt that the hardware is impressive and actually groundbreaking for the formfactor and power requirements, but it would be fitting to speed up the process of actually matching the effort of sales-hype to maturing the software stack.

Especially taking the massive delays, kaos in distribution and lack of favoring the early backers in consideration.

Nvidia is lucky, that did not create a regular shitstorm.

I love the hardware, expected the edge software stack, but not such a timeline and most of all lack of clear communicated information on the timeline of native software support.

I just really don’t understand the lack of official communication on the matter…

hello, now across frameworks they have access to dgx spark and agx thor, so, we are working with all developers to improve the support

tcgen05 for dummies - gau-nernst's blog one of the best dev from community

This is great but why link a blogpost that does a deep dive on instructions we can’t access on the spark?

I get the frustration here, but there’s a fundamental misunderstanding driving most of these issues.

SM12x (GB10 / DGX Spark / RTX 50) does not implement tcgen05, and therefore it also doesn’t support the associated FP4 Tensor Core path exposed through that ISA.

The current restrictions you’re seeing in CuTe (e.g. limiting tcgen05 ops to sm_100a / sm_110f families) are intentional, those ops map to hardware features that only exist on datacenter Blackwell (SM100/SM110), which includes:

  • tcgen05 MMA instructions

  • TMEM-backed accumulation model

  • blockscaled FP4 paths tied to that pipeline

SM12x is a different architecture target, and uses a different Tensor Core programming model. So mapping SM121 → SM100 just to “enable” tcgen05 would be incorrect and likely produce invalid codegen.

For DGX Spark / GB10 or sm120, FP4 support needs to go through the supported MMA pipelines for that architecture, not tcgen05. Example: cutlass/examples/python/CuTeDSL/blackwell_geforce/dense_gemm.py at main · NVIDIA/cutlass · GitHub

Happy to take a look at specific use cases if you’re trying to get FP4 working on SM121, from our side we are working on it, to fix the issues that @eugr and other communities members get, across multiple frameworks.

Ok, so why isn’t it currently across at least tensorrt-llm and vllm? Tensorrt-llm is literally your runtime, i’d expect the full capabilities of the hardware to be realized.

Happy to take a look at specific use cases if you’re trying to get FP4 working on SM121

The use cases are inference, I’d like inference to work on my nvidia inference box that’s running an nvidia quantization format with an nvidia model.

If the ‘supported mma pipelines’ on sm120 and sm121 aren’t available via any software then we’re not really using the hardware am I wrong?

I believe most of us are looking for concrete timelines to see substantial improvements. The Spark was promised and sold as more capable than it proves to be right now. The community tries it very best to help close that gap wherever it can and sees some performance improvements that would much benefit from additional support from Nvidia.

If tcgen is officially off the table now: what can we expect from Nvidia to help make the Spark more capable?