When I read the the fragment in mma with tensor core. it said that “A warp executing mma.m8n8k4 with .f16 floating point type will compute 4 MMA operations of shape .m8n8k4.”
First, I do not know why it should be 4 MMA operations of shape .m8n8k4.
Sencond, At MMA computation 1 part detailed in the document, Since the register of thread T0 owns the data of the 0th row of the A matrix, T0 also owns the first 4 data of the first row of the B matrix. According to the matrix calculation, a0 in T0 should only be multiplied by b0 b1 b2 b3 in T0, then a1 a2 a3 does not participate in the computation1? How is this MMA computation 1 calculated in detail?
The TC unit of cc7.0 devices was designed to have that kind of hardware behavior, for at least one of the supported hardware paths. “Why is it that way?” questions can be difficult to answer. You will find with a bit of searching that the performance characteristic of that particular tensor op form varies significantly depending on GPU arch, so it seems clear to me that the GPU designers had different ideas as GPU development and TC development progressed.
The calculation produces a (four) m8n8k4 matrix-matrix multiply. There is only one correct behavior for that statement. Beyond that, I don’t know of any detailed specifications for TC unit behavior.
Here are some questions that may be of interest: 123
Thank you for your reply. But I think if there are some statements in the document, it is necessary to make it clear, otherwise the document will no longer make sense.
It seems to me that, the matrix A and B are loaded by all of the 4 quad pairs, which results in duplicated (and unnecessary) data load.
Also, the 4 quad pairs seems to calculate the exact same 8x8 result, which is also confusing. Will only one result of the 4 quad pairs be kept and the other three are discarded?
The only guess I can think about is that it might be hard to add “mask” logic for tensor core instruction at warp-level, to keep 3/4 part of the warp idle, so that it sacrifices efficiency for simplicity.
There was an academic (3rd party) paper analyzing the m8n8k4 behavior with what each of the 4 steps does.
Does it have so much relevancy now? 7.0 tensor cores are rather outdated now and newer architectures support this matrix size in a slower compatibility mode.
But based on the figure in 9.7.15.4.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type, each of the four MMA operation loads a 8x4 of matrix A, and 4x8 of matrix B. This is exactly the shape of source matrix A and B, which looks to me that each MMA operation loads the full matrix A and B.
How is it able to calculate 4 separate part if it loads the full(and same) matrix A and B?
Those 4 mma operation register layouts correspond to 4 separate A and B matrices of the appropriate shape(s). They are not all referring to the same A and B matrices.
Thank you for your explanation, I wonder is there any figure to illustrate the fragment to thread mapping?
I still couldn’t get the mapping right
(1) Source Matrix A is of shape 8x4
(2) For MMA operation 1, each thread of a QP loads four elements, which is also a shape 8x4
(3) For MMA 2/3/4, same as (2)
How does each QP load separate part of A and B?
Besides, if we calculate the FLOPs, a 8x8x4 operation requires 512 FMA, but calculating 4 of 8x8x4 needs 2048, where does the increased FLOPs come from?
I’ve read the document again, and draw a figure to illustrate how the fragment is mapping to each QP.
Let’s support A is row-major and B is column-major, and A has shape. 8x4, and B has shape 4x8.
The row and column of A matrix fragment can be computed as:
row = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwise
col = i for ai where i = {0,..,3}
The rule for B
The row and column of B matrix fragment can be computed as:
row = i for bi where i = {0,..,3}
col = %laneid % 4 if %laneid < 16
(%laneid % 4) + 4 otherwise
Therefore, we get the following figure, in which each QP loads the upper half and lower half of matrix A and B. Together, each QP loads a full part of matrix A and B.
Perhaps this example will help. First let’s point out that the PTX instruction ( mma.sync.aligned.m8n8k4.row.col.f16.f16.f16.f16) does not actually load data from global or shared memory. It works out of registers. And the diagrams (e.g. Figures 22-26 in the PTX guide) represent the mapping of registers (in each thread) to elements of A,B,C.
It is as if the operation being performed is:
C1 = A1xB1 (handled by threads 0-3, and 16-19)
C2 = A2xB2 (handled by threads 4-7, and 20-23)
C3 = A3xB3 (handled by threads 8-11, and 24-27)
C4 = A4xB4 (handled by threads 12-15, and 28-31)
We will call the above items “computations” 1,2,3, or 4 (to be consistent with PTX nomenclature)
we have a if sequence that loads the registers for each thread. This if-sequence is broken down by computation number. For computation 1, we are loading the A1 matrix with values of 1.0. For computation 2, we are loading the A2 matrix with values of 2.0. Likewise for computations 3 and 4. For all computations, we are loading the Bn matrix with 1.0. Therefore we expect computation results of all 4 for C1, all 8 for C2, all 12 for C3, all 16 for C4.
rather than loading register values directly, we could have loaded the registers from any shared memory location or any global memory location. This means that the computations 1 through 4 are working on independent data sources. This is true most directly when we look at the register footprint, but also true when we consider that we could have loaded these registers from anywhere.
to answer your question from this thread we can see in the SASS dump output that the PTX mma instruction for m8n8k4 compiles to two SASS instructions (in the sm_70 case, anyway), one labelled part0 and the other labelled part1. Since it is evident (now) that these two SASS instructions are somehow computing all 4 independent computations, we must conclude that a sufficient number of FMA ops are provided between those 2 SASS instructions, for all 4 computations. You could paste this code into godbolt to witness the same thing.
I think you are getting confused between FLOPs and FMA. An FMA constitutes two FLOPs. For m8n8k4 matrix-multiply, the number of required FMA operations is 8x8x4 = 256. The number of required FLOPs is 512.
SInce the 4 computations require 256 FMA or 512 FLOPs each, we must conclude that the two-instruction SASS sequence (part0/part1) must provide the necessary FLOPs, i.e. 4x512 = 2048. Beyond that, I’m not sure how to answer the question “where does the increased FLOPs come from?” The flops are provided by the Tensor Core unit. Since the code provably creates the correct results, we must conclude that the two instruction SASS sequence is enough to provide those FLOPs.