Hi,
We are porting some algorithms to use the Tensor Cores BMMA.
The tensor core mma fragments have a specific layout in registers. Documented for example here.
The documentation states that the A fragments are placed in registers as (citing):
A vector expression containing four
.b32
registers
I could not find any details about vector expressions, e.g. what the required properties are.
→ Question1: Is there any documentation as to what a “vector expression” is?
In order to understand the motivation, here are some details:
As far as I understand, matrix fragments for 16x8x256 mma need to be loaded by hand. That works for us, and we get the expected results. However the documented fragment layout is a bit unfortunate for our later processing. Thus we would like to permute the input and output registers, in order to get rid of a separate distribution of the outputs over the threads. Right now the results are taking a round trip through shared memory.
→ Question2: Given the example from this post, could we do e.g.
asm(
"mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.xor.popc "
"{%0,%1,%2,%3}, {%4,%5, %6, %7}, {%8, %9}, {%10,%11,%12,%13};\n"
: "=r"(D[2]), "=r"(D[3]), "=r"(D[0]), "=r"(D[1]) // 0 1 2 3 -> 2 3 0 1
: "r"(A[3]), "r"(A[2]), "r"(A[1]), "r"(A[0]), // 0 1 2 3 -> 3 2 1 0
"r"(B[0]), "r"(B[1]),
"r"(C[3]), "r"(C[1]), "r"(C[2]), "r"(C[0]) // 0 1 2 3 -> 3 1 2 0
);
It seems to work, however I am wary of correctness and performance considerations. Neither seem to be explained in the documentation.
(Bonus Question: Could one go as far as placing the same all-zero register for C four times? Would save 3 registers per thread.)