WMMA - What does "warp matrix operations" mean?

I wanted to explore the tensor cores and use them to multiply some matrices generated in a kernel. Now I came across this line from the docs:

C++ warp matrix operations leverage Tensor Cores to accelerate matrix problems of the form D=AxB+C. These operations are supported on mixed-precision floating point data for devices of compute capability 7.0 or higher. This requires co-operation from all threads in a warp. In addition, these operations are allowed in conditional code only if the condition evaluates identically across the entire warp, otherwise the code execution is likely to hang.

This is probably very basic but I am still confused. What exactly does it mean? When I define the matrices for the multiplication in a CUDA kernel:

wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::col_major> matA;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> matB;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> matAcc;
wmma::fill_fragment(matAcc, 0.0f);
//fill matA and matB with data
wmma::mma_sync(matAcc, matA, matB, matAcc);

My main questions are:

  1. Are these matrices shared in the whole warp similarly to the principles of warp-level primitives?

  2. Where are the data of the matrices stored anyway? Shared memory? Thread registers (would be too many)?

  3. I wanted to do this simple thing. Each thread fills the matrices and then calls the multiplication. Will I get a unique result for each thread? Is this a good practice? From what I understood is that the multiplications will happen sequentially withing the warp - so in 32 steps. Is that right?

Please excuse my lack of understanding, I haven’t found many sources or examples that would clearly answer these questions. Thank you for your time!

A matrix is a set of numbers. Lets imagine a 16x16 matrix, so 256 numbers. The operation is a “collective” operation done across the warp and so each thread in the warp participates. One of the forms of participation is that each thread in the warp is “responsible for providing” a subset of those 256 numbers. 32 threads in the warp means each thread provides (for a 16x16 input matrix) 8 of the 256 numbers. Those 8 numbers are stored in registers “owned” by that thread, in a similar fashion to the way a thread stores and retrieves any other numbers. We are talking about SASS level registers here like R0, R1, R2, etc.

Likewise results/output data get stored in thread registers.

When preparing to issue the warp matrix multiply op/instruction, these data must be stored in thread level registers, as already described.

You will get a unique result per thread. Using our 16x16 example, each thread will have deposited in some set of its registers, 8 (unique) elements from the 256 that compose the resultant 16x16 matrix.

There is conceptually only 1 way to do a wmma op, so there is only one possible good practice if you intend to do a wmma op. What you do leading up to the op, and what you do after completing the op, can be the subject of care and exquisite programming, and thus we have a library like CUTLASS.

There isn’t anything on a GPU that is done sequentially across a warp. A wmma multiply op, like all instructions executed by the GPU, is issued warp-wide and executed warp-wide. The instruction will be issued to the tensor core unit that is associated with that SMSP (SM sub-partition) and will complete in some number of cycles, unspecified. You can certainly try to discover the latency, I doubt it happens to be 32. Most arithmetic latencies in a GPU are on the order of 5-10 cycles.

Other than the fact that this instruction effectively uses (lets say) 8 registers, this instruction is really a lot like any other instruction. Just like a FFMA instruction (fused FP32 multiply-add) the instruction:

  • is issued warp-wide
  • is issued to a particular functional unit in the SM
  • is issued at the same time (in the same clock cycle), across all threads in the warp
  • produces its result some number of cycles later
  • has a set of registers that it uses as input data for the op
  • has a set of registers that it uses for the output data from the op

A key difference is that the actual arithmetic performed produces results for a given thread that depend on data that originates in the registers of other threads. We would not make that statement about a FFMA instruction, where each arithmetic op across threads in the warp is fully independent. As a result, a FFMA instruction requires 32 FMA units to issue/complete, whereas a WMMA instruction requires a single TC unit.

1 Like

Thank you very much for the thorough answer!

You will get a unique result per thread.
The operation is a “collective” operation done across the warp and so each thread in the warp participates…each thread in the warp is responsible for providing a subset of those 256 numbers.

So the warp-wide operation means that each thread can own unique matrices and perform unique multiplications but the operation itself uses “internally” all the threads within the warp to utilize the tensor cores, read and write the results. In other words, as a programmer, I don’t have to care about the warp thing (if not aiming for specific optimizations). The matrices are not shared between threads from the code point of view and there is no need to, for example, fill one matrix with multiple threads - not even possible if not stored in shared memory. So basically it’s possible to have 32 x 3 unique matrices per warp when having MR = MA * MB + MR in each thread.

Do I get it right? Thanks again!

No, there is only one multiplication performed.

Yes, you do. The operation is undefined if not issued for an entire warp.

Yes, they are. Considered for a warp, there is a single 16x16 matrix multiply operation.

A x B = C

A is a matrix that has 256 elements. B is a matrix that has 256 elements. C is a matrix that has 256 elements. 8 of the elements belonging to A will be stored in registers associated with a particular thread in the warp. Another 8 elements belonging to A will be stored in registers associated with another particular thread in the same warp. And so on. And likewise for B, and for C.

The operation as a whole has no meaning when considered from the vantage point of a single thread. If you don’t have all 32 threads participating, and you have not divided up the elements of the A, B, and C matrices appropriately into registers of each of the 32 threads, then you’re doing it wrong and the results are undefined.

Most of what you have said is not the way I would say it.

1 Like

Haha now I understand, I hope! Sorry for the trouble!

What you meant by “You will get a unique result per thread.” was not the whole result matrix but the subset, right?

So when I do this in the example:

 for(int i=0; i < c_frag.num_elements; i++) ...

I am accessing the 8 values that are stored in the particular thread?

There is also this line:

The mapping between individual matrix parameters to their fragment(s) is opaque, so your program should not make assumptions about it.

So if I want to, let’s say use each thread to generate 8 values for the matrix at a given position, the best way is to first generate the matrices in a shared memory and then use load_matrix_sync to load them into the fragments automatically?

At first I imagined that I would be able to determine which thread writes to which position of the matrix directly.

Yes, each thread is responsible for a non-overlapping subset of the A, B, and C matrices. The result per thread is unique, because it corresponds to a unique subset of the result matrix C.

I don’t know which example you are referring to.

Yes, the origin of the A and B matrices is presumably some sort of “generally accessible space”. That could be global or shared memory. The threads would each load their respective fragments of A and B before the op. Since (at the CUDA C++ level) the matrix fragment is “opaque”, pretty much by definition at the CUDA C++ level, you do not know which of the 256 elements of A will end up in thread 0 registers, for example. Therefore, you would have to assume that the matrix is “generally accessible” and then let the fragment loaders do their thing, opaquely.

It is somewhat less (in my view) “opaque” at the PTX level. For an example, see here. However since the mapping from virtual register to physical registers is “opaque” in PTX, there is still some “opacity”. However we can make observations about assignment of matrix elements to threads by studying PTX.

1 Like

Sorry, I was referring to the example in the official blog post but I guess it is so. Now I understand why is the type called fragment.

Thank you very much for the help. It is clear now.

It’s a pity that one has to do this double copy though. Like storing to memory and then loading back to the registers. The PTX level might help but it would be nice to have a more direct zero-copy option.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.