Can we directly use register value for tensor core calculation?

From official guide, I find tensor core can only read global and shared memory data , but my friend sent me this:

asm volatile (
    ".reg .f16x2 %Ra<4>, %Rb<2>, %Rc<2>; \n"
    " %Ra0, [%0]; \n"
    " %Ra1, [%0 + 256]; \n"
    " %Ra2, [%0 + 16]; \n"
    " %Ra3, [%0 + 272]; \n"
    " %Rb0, [%1]; \n"
    " %Rb1, [%1 + 16]; \n"
    " %Rc0, [%2]; \n"
    " %Rc1, [%2 + 128]; \n"
    "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 %Rc0, %Rc1, {%Ra0, %Ra1, %Ra2, %Ra3}, %Rb0, %Rb1, {%Rc0, %Rc1}; \n"
    " [%3], %Rc0; \n"
    " [%3 + 128], %Rc1; \n"
    : "l"(A + A_index), "l"(B + B_col_major_index), "l"(C + C_index), "l"(D + D_index)

Seems directly use register data for tensor core is OK?? Really? Where can I find such doc?

Fragment layouts are also present in the ptx documentation. 1. Introduction — parallel-thread-execution 8.2 documentation

1 Like

Thank you!!! This seems very interesting! But I can not understand it fully… Seems it is the first time NV release how tensor core does the matmul??

And … is there any description for PTX using register for mma?

Also, what is the difference between mma and wmma?

(I … did not find these in the link…)


Here are some examples that don’t use shared: 1 2 3

Those examples are using “register for mma”. Of course register data has to come from somewhere. So if you want to, you can load data into a register from global, or local, or shared, and then pass those registers directly to the PTX mma instruction, more-or-less as the examples I linked indicate.

wmma is the earliest exposure of tensorcore ops, e.g. in the v100 timeframe. As tensorcore added variety, a new instruction format (mma) was added. An example of a difference is given here

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.