WMMA vs. MMA

sohamgovande · January 7, 2025, 12:39am

I’ve been exploring the performance characteristics of WMMA and MMA instructions, specifically for BF16 operations, and I’ve encountered some interesting results. The MMA instruction variant I’m focusing on is m16n8k16 for BF16, which also has a direct counterpart in WMMA.

From what I understand, the primary difference between WMMA and MMA seems to be that WMMA requires explicit fragment loading, whereas MMA does not. In my experiments, MMA appears to be slightly faster, but the performance gap is not entirely clear to me.

This raises a couple of questions:

Why do both WMMA and MMA instructions exist if they appear to provide similar functionality? Are there specific scenarios or use cases where one is preferred over the other?
What factors might explain the performance difference I’m observing? Is it related to the overhead of explicit fragment loading in WMMA, or are there other architectural nuances at play?

I’d greatly appreciate any insights or experiences others may have regarding these instructions. Understanding these differences could help optimize my workflows and clarify the best use cases for each.

Thanks in advance for your help!

Robert_Crovella · January 7, 2025, 12:59am

One reason is that they serve slightly different needs. With a bit of study of what is exposed in CUDA C++, you will discover that it is all (PTX-equivalent) wmma functionality, along with corresponding matrix fragment loading. This “hides” the notion of PTX registers, which would probably be cumbersome to wrap in C++ clothing, but also limits what you can do.

The (PTX) mma functionality has no C++ counterpart, but is more flexible in its usage (mainly from a setup perspective) in that it exposes its register footprint directly. (It is possible to use PTX mma instructions in CUDA C++ via inline PTX, of course.)

So mma functionality in my view is more “flexible” or perhaps “exposed”, whereas wmma probably serves the important purpose of making the TC functionality “directly” exposed/usable in CUDA C++ (via intrinsics, of course.)

Regarding performance, and since you:

haven’t indicated the magnitude of the performance difference
haven’t indicated how you are testing and measuring
are explicitly asking about/ mentioning fragment loading

I would first want to do a much closer exploration. I’d be very surprised if the actual wmma vs. corresponding mma SASS instruction had much variation in performance. So the place that I would start is to work on a test case/comparison that eliminated as much as possible any data loading of any kind.

I do think it is entirely possible that wmma-style fragment loading may be different from a performance perspective than loading of registers using some other approach. But with respect to the TC usage itself, I’d be surprised if there were much difference.

And for large scale matrix-multiply operations, properly written code should be mostly compute-bound, not memory bound, and I don’t think its a “close” comparison. If that is the case, small differences in data loading efficiency should not have a marked impact on the overall large scale compute bound matrix multiply. But if you are comparing a single mma SASS op to a single wmma SASS op, you are nowhere close to being compute bound.

This may be of interest for some background.

The docs provide comparison info as well.

system · January 21, 2025, 12:59am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about efficient usage of wmma CUDA Programming and Performance	2	317	February 29, 2024
What is the best way to re-use a tensor core C fragment now as A or B input when their types differ? CUDA Programming and Performance	5	714	November 24, 2023
Why does WMMA and MMA support different matrix tile size? CUDA Programming and Performance	2	1941	October 28, 2023
Wmma f16 load always loads into 8 2xf16 registers CUDA Programming and Performance	4	609	September 9, 2023
How to use WMMA efficiently CUDA Programming and Performance	4	7920	October 23, 2020
CUDA tensor core register mapping? CUDA Programming and Performance	5	870	January 26, 2024
instruction or operation CUDA Programming and Performance	16	3243	March 28, 2019
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3943	December 21, 2016
Direct access to Volta HMMA instruction CUDA Programming and Performance	9	5223	December 19, 2017
The HMMA.884 tensor core instruction seems not match with its cuda warp-level mma instruction CUDA Programming and Performance	5	205	August 22, 2024

WMMA vs. MMA

Related topics