I get the frustration here, but there’s a fundamental misunderstanding driving most of these issues.
SM12x (GB10 / DGX Spark / RTX 50) does not implement tcgen05, and therefore it also doesn’t support the associated FP4 Tensor Core path exposed through that ISA.
The current restrictions you’re seeing in CuTe (e.g. limiting tcgen05 ops to sm_100a / sm_110f families) are intentional, those ops map to hardware features that only exist on datacenter Blackwell (SM100/SM110), which includes:
-
tcgen05 MMA instructions
-
TMEM-backed accumulation model
-
blockscaled FP4 paths tied to that pipeline
SM12x is a different architecture target, and uses a different Tensor Core programming model. So mapping SM121 → SM100 just to “enable” tcgen05 would be incorrect and likely produce invalid codegen.
For DGX Spark / GB10 or sm120, FP4 support needs to go through the supported MMA pipelines for that architecture, not tcgen05. Example: cutlass/examples/python/CuTeDSL/blackwell_geforce/dense_gemm.py at main · NVIDIA/cutlass · GitHub
Happy to take a look at specific use cases if you’re trying to get FP4 working on SM121, from our side we are working on it, to fix the issues that @eugr and other communities members get, across multiple frameworks.