PTX support for matrix shapes is larger than what is exposed via wmma:bmma_sync.
Here only 8x8x128 is supported, but Nvidia PTX has support for (and actually implements)
For example m16n8k256.
This would mean that Nvidia can finally expose more of the already existing instructions to wmma. All that needs to be done is to expose the fragment shapes and bmma_sync overloads in mma.h accorting to the ISA.
I have been waiting for this since 2022, but no update exported all the existing tensor core operations yet.
Thanks for reaching out to us.
You are absolutely right that the PTX ISA exposes more matrix shapes (such as m16n8k256 ) than are currently available through the wmma::bmma_sync C++ API, and that in principle those could be mapped into additional fragment shapes and overloads in mma.h .
However, the direction of our tensor core programming model has changed since WMMA was first introduced:
On Hopper, the recommended high‑performance path is wgmma, which uses warp‑group level MMA (4 warps) rather than a single warp, and has a different programming model than WMMA.
On Blackwell , tensor cores are exposed via tcgen05, again with a different programming model (1–2 CTAs doing MMA).
Future architectures will continue to evolve in this direction, rather than extending the warp‑wide WMMA interface.
Because of that, WMMA is now considered a compatibility / fallback interface, not the main vehicle for exposing all tensor core capabilities or for reaching peak performance on newer GPUs. For Turing/Ampere as well, the lower‑level mma PTX instructions that operate directly on fragment layouts are the preferred way to fully control performance characteristics (e.g., avoiding shared‑memory bank conflicts) beyond what WMMA can express.
So while your request to add all PTX‑supported BMMA shapes to wmma::bmma_sync is technically reasonable, it does not align with how the tensor core programming model is evolving. The engineering effort is instead focused on the newer interfaces (wgmma, tcgen05, etc.) that match the hardware more closely and provide the best performance and flexibility going forward.
We really appreciate that you’ve been pushing on this since 2022 and understand the frustration of seeing PTX capabilities not fully surfaced at the WMMA level. Your feedback is valuable, and we’ll make sure it’s visible internally, but we also want to be transparent about why WMMA is unlikely to be expanded to cover every PTX tensor core shape.
I get it - as customer of course I am not happy, SM86 had hardware instructions that were never surfaced to any C or C++ header directly like the other 2 shapes (i dont know why exposing only half of ISA was done back then) where its easily usable, and now you wont do it since the focus is on SM100.
For anyone reading this answer summary is this - no nvidia wont implement it.
customers have to do this if we want to use bmma ops in a kernel (which is slow in newer hardware)
I personally like to put those mma PTX instructions inside their asm block into a single __ device __ __ forceinline __ C++ function with all the input and output parameters provided as references.
I often use direct registers instead of arrays, but probably both approaches work.
So the actual code neither uses wmma, nor cutlass, nor asm blocks, but calls this inlined function with the asm block and the single mma instruction.
For advanced use cases, e.g. int16, int24, int32 matrix multiplications (which are solved by combining several int8), I also abstract.
The same for simulating sparse and wider matrix formats on older generations (like Turing).