Use register for mma calculation results store

half-0 · October 31, 2024, 2:55pm

Mma operation is recommanded for the calculation of D = A * B + C in fp16. However, many of the tutorial about how to use mma is just use shared memory to do everything. I wonder if there are some tutorials that use register to store C & D through the whole process?

half-0 · October 31, 2024, 3:04pm

The cuda programming guide only talks about how to fill in the PTX blanks by registers.
e.g:
.reg .f16x2 %Ra<4>, %Rb<2>, %Rc<2>, %Rd<2>;
mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16
{%Rd0, %Rd1},
{%Ra0, %Ra1, %Ra2, %Ra3},
{%Rb0, %Rb1},
{%Rc0, %Rc1};

However, as a register has 32 bits, I believe Rc & Rd can be save in a more space saving way. Can anyone give me a real code piece to show me how to store the C & D in the register?
For example, I allocate 32 registers per thread for Rc & Rd, they ought to save 64 fp16 values. I need to choose some of them for mma operation in every calculation turn dynamically. What should I do?

half-0 · October 31, 2024, 3:08pm

Should I use __half2 type to save the values in the register?
e.g
global f() {
__half2 Rc[32];
__half2 Rd[32];
…
}

Robert_Crovella · October 31, 2024, 3:32pm

half-0 · October 31, 2024, 4:14pm

Will compiler automatically coalesce the register in order to use less register? For example, in your code, you define “half a[4]”. Will it use 2 registers or 4 registers in the end?

half-0 · October 31, 2024, 4:19pm

Do I need to coalesce the registers by myself?

Robert_Crovella · October 31, 2024, 5:16pm

I’m not going to use the word “coalesce” here, although I know what you mean, I think. You are interested in arguments being packed in a sensible fashion into registers.

I would suggest (maybe others will disagree with me) that these questions cannot be fully answered or are not best answered by studying PTX. To be conclusive, you should have some familiarity with SASS. You can find many forum posts that discuss how to work with SASS. Even casually using the word “registers” in PTX vs. “registers” in SASS can have a slightly different meaning - they are not precisely the same idea.

However I’m going to immediately contradict myself. If you study the code carefully, you will note that all the 16-bit PTX “registers” and arguments, when presented to the PTX mma instruction, are actually already combined via curly braces. Those indicate a vector notation in PTX, which includes implications for “storage”. Furthermore, you can even see the combination at the CUDA C++ level - all of the data are reinterpreted into unsigned quantities, before being fed as arguments to the PTX mma instruction. That’s a 32-bit quantity in CUDA C++, and when it manifests in a register, it always manifests as a 32-bit register. The 16-bit quantities are already combined into 32-bit quantities, at the CUDA C++ level, before you get to PTX or SASS. You can use your knowledge of C++ to discover and decide the storage order of 16 bit quantities.

Curefab · October 31, 2024, 5:30pm

m16n8k16 has C and D of matrix size m x n = 16 * 8 = 128 values each, not 64 values each.

That is 2 32-bit registers with 2 16-bit values each over 32 threads (or lanes) per warp = 128 for each of C and D.

C and D are distributed over all 32 threads. You would not allocate a 32-registers array or 64-registers array or 128-registers array for that matter, but just use two 32-bit registers (e.g. __half2) for C and two 32-bit registers for D.

half-0 · October 31, 2024, 5:40pm

Thanks!

half-0 · November 12, 2024, 8:19am

Another question：If the calculation is C = A * B + C, can I just write code in this way?
float const *C = reinterpret_cast<float const *>(&c);
float *D = reinterpret_cast<float *>(&c);

Robert_Crovella · November 12, 2024, 3:50pm

I don’t think that will work, did you try it?

Types matter, of course, and this includes when passing data into PTX registers. But you’ve not provided a complete code to demonstrate the change, so I can’t be certain of what you are suggesting. When PTX wants a u32 register, a floating point quantity generally will not suffice.

half-0 · November 12, 2024, 3:52pm

like this

__global__ void mma_fp16_acc_fp32(float *out) {
    float c[4] = {0., 0., 0., 0.};
    float d[4] = {0., 0., 0., 0.};
    half a[4] = {1., 1., 1., 1.};
    half b[2] = {1., 1.};
    // the above would set our input matrices to all 1
    // now lets modify some values
    if (threadIdx.x%4 == 0) {
    // set the first column of A to be 0, 1, 2, 3, ... 15
      a[0] = threadIdx.x/4; a[2] = threadIdx.x/4 + 8;
    // set the second row of B to 3,3,3, ... 3
      b[1] = 3;}
    unsigned const *A = reinterpret_cast<unsigned const *>(&a);
    unsigned const *B = reinterpret_cast<unsigned const *>(&b);
    float const *C = reinterpret_cast<float const *>(&c);
    float *D = reinterpret_cast<float  *>(&c);
    asm(
      "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 "
      "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
      : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
      :
        "r"(A[0]), "r"(A[1]),
        "r"(B[0]),
        "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
    );
    memcpy(out+threadIdx.x*2, D, 8);
    memcpy(out+8*8+threadIdx.x*2, D+2, 8);
}

I just want to calc C = A * B + C

Robert_Crovella · November 12, 2024, 3:59pm

OK I missed the D/C overlap, sorry.

Yes, D and C can refer to the same matrix. From here:

where D and C are called accumulators and may refer to the same matrix.

half-0 · November 12, 2024, 4:35pm

Thanks！

Curefab · November 12, 2024, 5:22pm

If you are using f16 for C and D you should cast to half2 instead of float.

If you are using f32 for C and D you should cast to float instead of half.

half-0 · November 17, 2024, 3:47am

I still have one question. If I want to declarate a, b as __half2 type to reduce the use of registers (because I need many half values to be stored in register), can I still write kernel in this way? Like below:

__global__ void mma_fp16_acc_fp32(float *out) {
    float c[4] = {0., 0., 0., 0.};
    float d[4] = {0., 0., 0., 0.};
    __half2 a[32] = {...}; // any initial value
    __half2 b[16] = {...}; // any initial value
    unsigned const *A = reinterpret_cast<__half2 *>(&a);
    unsigned const *B = reinterpret_cast<__half2 *>(&b);
    float const *C = reinterpret_cast<float const *>(&c);
    float *D = reinterpret_cast<float  *>(&c);
    // then, I declarate and assign value to integer tmp_0, tmp_1, tmp_2 depend on threadIdx
    asm(
      "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 "
      "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
      : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
      :
        "r"(A[tmp_0]), "r"(A[tmp_1]),
        "r"(B[tmp_2]),
        "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
    );
    memcpy(out+threadIdx.x*2, D, 8);
    memcpy(out+8*8+threadIdx.x*2, D+2, 8);
}

Curefab · November 17, 2024, 3:46pm

The dependence of the index on tmp_0 / tmp_1 and tmp_2 will lead to the use of local memory instead of of registers. Try to fill the initial values correctly depending on thread instead of giving the same data to each thread and then later on using indices.

system · December 1, 2024, 3:47pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why compiler don't use registers to store my data? CUDA Programming and Performance	43	449	December 7, 2024
Error or incomprehension, MMa ptx mixed precision Bfloat16 rtx3080 CUDA Programming and Performance	20	2598	October 12, 2021
Complete minimal ptx example for: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 CUDA Programming and Performance	3	908	October 25, 2024
Register in kernel CUDA Programming and Performance	3	75	November 17, 2024
About compute accuracy CUDA Programming and Performance	22	225	February 10, 2025
.reg f16x2 %Rb<1> in ISA example: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 CUDA Programming and Performance	1	70	October 25, 2024
Padding of mma operation CUDA Programming and Performance	20	295	December 19, 2024
Wrong answer with mma.sync.aligned.m8n8k4 CUDA Programming and Performance cuda , kernel	8	1505	April 17, 2023
# of registers in different for different datatypes CUDA Programming and Performance	3	621	January 21, 2020
Problem with the instruction "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32" CUDA Programming and Performance	3	1992	October 12, 2021

Use register for mma calculation results store

Related topics