WMMA vs. MMA

I’ve been exploring the performance characteristics of WMMA and MMA instructions, specifically for BF16 operations, and I’ve encountered some interesting results. The MMA instruction variant I’m focusing on is m16n8k16 for BF16, which also has a direct counterpart in WMMA.

From what I understand, the primary difference between WMMA and MMA seems to be that WMMA requires explicit fragment loading, whereas MMA does not. In my experiments, MMA appears to be slightly faster, but the performance gap is not entirely clear to me.

This raises a couple of questions:

  1. Why do both WMMA and MMA instructions exist if they appear to provide similar functionality? Are there specific scenarios or use cases where one is preferred over the other?

  2. What factors might explain the performance difference I’m observing? Is it related to the overhead of explicit fragment loading in WMMA, or are there other architectural nuances at play?

I’d greatly appreciate any insights or experiences others may have regarding these instructions. Understanding these differences could help optimize my workflows and clarify the best use cases for each.

Thanks in advance for your help!

One reason is that they serve slightly different needs. With a bit of study of what is exposed in CUDA C++, you will discover that it is all (PTX-equivalent) wmma functionality, along with corresponding matrix fragment loading. This “hides” the notion of PTX registers, which would probably be cumbersome to wrap in C++ clothing, but also limits what you can do.

The (PTX) mma functionality has no C++ counterpart, but is more flexible in its usage (mainly from a setup perspective) in that it exposes its register footprint directly. (It is possible to use PTX mma instructions in CUDA C++ via inline PTX, of course.)

So mma functionality in my view is more “flexible” or perhaps “exposed”, whereas wmma probably serves the important purpose of making the TC functionality “directly” exposed/usable in CUDA C++ (via intrinsics, of course.)

Regarding performance, and since you:

  1. haven’t indicated the magnitude of the performance difference
  2. haven’t indicated how you are testing and measuring
  3. are explicitly asking about/ mentioning fragment loading

I would first want to do a much closer exploration. I’d be very surprised if the actual wmma vs. corresponding mma SASS instruction had much variation in performance. So the place that I would start is to work on a test case/comparison that eliminated as much as possible any data loading of any kind.

I do think it is entirely possible that wmma-style fragment loading may be different from a performance perspective than loading of registers using some other approach. But with respect to the TC usage itself, I’d be surprised if there were much difference.

And for large scale matrix-multiply operations, properly written code should be mostly compute-bound, not memory bound, and I don’t think its a “close” comparison. If that is the case, small differences in data loading efficiency should not have a marked impact on the overall large scale compute bound matrix multiply. But if you are comparing a single mma SASS op to a single wmma SASS op, you are nowhere close to being compute bound.

This may be of interest for some background.

The docs provide comparison info as well.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.