Use register for mma calculation results store

Mma operation is recommanded for the calculation of D = A * B + C in fp16. However, many of the tutorial about how to use mma is just use shared memory to do everything. I wonder if there are some tutorials that use register to store C & D through the whole process?

The cuda programming guide only talks about how to fill in the PTX blanks by registers.
e.g:
.reg .f16x2 %Ra<4>, %Rb<2>, %Rc<2>, %Rd<2>;
mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16
{%Rd0, %Rd1},
{%Ra0, %Ra1, %Ra2, %Ra3},
{%Rb0, %Rb1},
{%Rc0, %Rc1};

However, as a register has 32 bits, I believe Rc & Rd can be save in a more space saving way. Can anyone give me a real code piece to show me how to store the C & D in the register?
For example, I allocate 32 registers per thread for Rc & Rd, they ought to save 64 fp16 values. I need to choose some of them for mma operation in every calculation turn dynamically. What should I do?

Should I use __half2 type to save the values in the register?
e.g
global f() {
__half2 Rc[32];
__half2 Rd[32];

}

Will compiler automatically coalesce the register in order to use less register? For example, in your code, you define “half a[4]”. Will it use 2 registers or 4 registers in the end?

Do I need to coalesce the registers by myself?

I’m not going to use the word “coalesce” here, although I know what you mean, I think. You are interested in arguments being packed in a sensible fashion into registers.

I would suggest (maybe others will disagree with me) that these questions cannot be fully answered or are not best answered by studying PTX. To be conclusive, you should have some familiarity with SASS. You can find many forum posts that discuss how to work with SASS. Even casually using the word “registers” in PTX vs. “registers” in SASS can have a slightly different meaning - they are not precisely the same idea.

However I’m going to immediately contradict myself. If you study the code carefully, you will note that all the 16-bit PTX “registers” and arguments, when presented to the PTX mma instruction, are actually already combined via curly braces. Those indicate a vector notation in PTX, which includes implications for “storage”. Furthermore, you can even see the combination at the CUDA C++ level - all of the data are reinterpreted into unsigned quantities, before being fed as arguments to the PTX mma instruction. That’s a 32-bit quantity in CUDA C++, and when it manifests in a register, it always manifests as a 32-bit register. The 16-bit quantities are already combined into 32-bit quantities, at the CUDA C++ level, before you get to PTX or SASS. You can use your knowledge of C++ to discover and decide the storage order of 16 bit quantities.

m16n8k16 has C and D of matrix size m x n = 16 * 8 = 128 values each, not 64 values each.

That is 2 32-bit registers with 2 16-bit values each over 32 threads (or lanes) per warp = 128 for each of C and D.

C and D are distributed over all 32 threads. You would not allocate a 32-registers array or 64-registers array or 128-registers array for that matter, but just use two 32-bit registers (e.g. __half2) for C and two 32-bit registers for D.

Thanks!

Another question:If the calculation is C = A * B + C, can I just write code in this way?
float const *C = reinterpret_cast<float const *>(&c);
float *D = reinterpret_cast<float *>(&c);

I don’t think that will work, did you try it?

Types matter, of course, and this includes when passing data into PTX registers. But you’ve not provided a complete code to demonstrate the change, so I can’t be certain of what you are suggesting. When PTX wants a u32 register, a floating point quantity generally will not suffice.

like this

__global__ void mma_fp16_acc_fp32(float *out) {
    float c[4] = {0., 0., 0., 0.};
    float d[4] = {0., 0., 0., 0.};
    half a[4] = {1., 1., 1., 1.};
    half b[2] = {1., 1.};
    // the above would set our input matrices to all 1
    // now lets modify some values
    if (threadIdx.x%4 == 0) {
    // set the first column of A to be 0, 1, 2, 3, ... 15
      a[0] = threadIdx.x/4; a[2] = threadIdx.x/4 + 8;
    // set the second row of B to 3,3,3, ... 3
      b[1] = 3;}
    unsigned const *A = reinterpret_cast<unsigned const *>(&a);
    unsigned const *B = reinterpret_cast<unsigned const *>(&b);
    float const *C = reinterpret_cast<float const *>(&c);
    float *D = reinterpret_cast<float  *>(&c);
    asm(
      "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 "
      "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
      : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
      :
        "r"(A[0]), "r"(A[1]),
        "r"(B[0]),
        "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
    );
    memcpy(out+threadIdx.x*2, D, 8);
    memcpy(out+8*8+threadIdx.x*2, D+2, 8);
}

I just want to calc C = A * B + C

OK I missed the D/C overlap, sorry.

Yes, D and C can refer to the same matrix. From here:

where D and C are called accumulators and may refer to the same matrix.

Thanks!

If you are using f16 for C and D you should cast to half2 instead of float.

If you are using f32 for C and D you should cast to float instead of half.

I still have one question. If I want to declarate a, b as __half2 type to reduce the use of registers (because I need many half values to be stored in register), can I still write kernel in this way? Like below:

__global__ void mma_fp16_acc_fp32(float *out) {
    float c[4] = {0., 0., 0., 0.};
    float d[4] = {0., 0., 0., 0.};
    __half2 a[32] = {...}; // any initial value
    __half2 b[16] = {...}; // any initial value
    unsigned const *A = reinterpret_cast<__half2 *>(&a);
    unsigned const *B = reinterpret_cast<__half2 *>(&b);
    float const *C = reinterpret_cast<float const *>(&c);
    float *D = reinterpret_cast<float  *>(&c);
    // then, I declarate and assign value to integer tmp_0, tmp_1, tmp_2 depend on threadIdx
    asm(
      "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 "
      "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
      : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
      :
        "r"(A[tmp_0]), "r"(A[tmp_1]),
        "r"(B[tmp_2]),
        "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
    );
    memcpy(out+threadIdx.x*2, D, 8);
    memcpy(out+8*8+threadIdx.x*2, D+2, 8);
}

The dependence of the index on tmp_0 / tmp_1 and tmp_2 will lead to the use of local memory instead of of registers. Try to fill the initial values correctly depending on thread instead of giving the same data to each thread and then later on using indices.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.