Can we directly use register value for tensor core calculation?

202476410arsmart · October 18, 2023, 4:39am

From official guide, I find tensor core can only read global and shared memory data https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-load-instruction-wmma-load , but my friend sent me this:

asm volatile (
    ".reg .f16x2 %Ra<4>, %Rb<2>, %Rc<2>; \n"
    "ld.global.b32 %Ra0, [%0]; \n"
    "ld.global.b32 %Ra1, [%0 + 256]; \n"
    "ld.global.b32 %Ra2, [%0 + 16]; \n"
    "ld.global.b32 %Ra3, [%0 + 272]; \n"
    "ld.global.b32 %Rb0, [%1]; \n"
    "ld.global.b32 %Rb1, [%1 + 16]; \n"
    "ld.global.b32 %Rc0, [%2]; \n"
    "ld.global.b32 %Rc1, [%2 + 128]; \n"
    "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 %Rc0, %Rc1, {%Ra0, %Ra1, %Ra2, %Ra3}, %Rb0, %Rb1, {%Rc0, %Rc1}; \n"
    "st.global.b32 [%3], %Rc0; \n"
    "st.global.b32 [%3 + 128], %Rc1; \n"
    :
    : "l"(A + A_index), "l"(B + B_col_major_index), "l"(C + C_index), "l"(D + D_index)
);

Seems directly use register data for tensor core is OK?? Really? Where can I find such doc?

striker159 · October 18, 2023, 4:51am

Fragment layouts are also present in the ptx documentation. 1. Introduction — parallel-thread-execution 8.2 documentation

202476410arsmart · October 18, 2023, 5:09am

Thank you!!! This seems very interesting! But I can not understand it fully… Seems it is the first time NV release how tensor core does the matmul??

And … is there any description for PTX using register for mma?

Also, what is the difference between mma and wmma?

(I … did not find these in the link…)

Thanks!!!

Robert_Crovella · October 18, 2023, 5:10am

Here are some examples that don’t use shared: 1 2 3

Those examples are using “register for mma”. Of course register data has to come from somewhere. So if you want to, you can load data into a register from global, or local, or shared, and then pass those registers directly to the PTX mma instruction, more-or-less as the examples I linked indicate.

wmma is the earliest exposure of tensorcore ops, e.g. in the v100 timeframe. As tensorcore added variety, a new instruction format (mma) was added. An example of a difference is given here

system · November 1, 2023, 5:10am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Complete minimal ptx example for: mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 CUDA Programming and Performance	3	405	October 25, 2024
Get wrong result using tensor core example CUDA Programming and Performance cuda , kernel	8	327	July 1, 2024
Is there a support for copy from shared memory to global memory without using registers? CUDA Programming and Performance cuda	7	162	October 9, 2024
Why indexing arrays with a constant amount is not in the register? CUDA Programming and Performance cuda	10	46	December 18, 2024
CUDA tensor core register mapping? CUDA Programming and Performance	5	877	January 26, 2024
Order of registers in MMA calls CUDA Programming and Performance	2	1189	June 13, 2022
How to achieve peak tensor core utilization TensorRT	1	744	September 20, 2022
Tensor core mechanism CUDA Programming and Performance	10	142	February 22, 2025
TMA async bulk tensor copy memory consistency CUDA Programming and Performance	0	707	April 25, 2024
Local variables and registers CUDA Programming and Performance	13	6165	March 23, 2010

Can we directly use register value for tensor core calculation?

Related topics