CUDA tensor core register mapping?

nobond · January 22, 2024, 3:28am

Is there a way i can get a feeling how the wmma fragment is mapped into the normal cuda register file?

I mean we know the wmma is a warp-level primitive and you would think for a case whereby a 2 4x4 matrix multiple multiply, it would map the fragment into the normal regsiter file which is normally allcoated as per thread one.

Interesting to know is there any intuition here to help to understand?

Regards

Robert_Crovella · January 22, 2024, 3:49am

For certain tensor core ops, the PTX guide will spell it out for you. There are at least 3 varieties of TC ops. wmma, mma, and wgmma

The wmma variant is the one exposed via CUDA C++ intrinsics, and that one (whether in CUDA C++ or PTX) has an intentionally opaque register footprint. The mma ops in PTX, for example, have a register footprint that is specified, you can find it in the PTX doc.

If you want to know the layout of the wmma ops, you will need to refer to unofficial sources/methods. I don’t have any to refer you to, but it seems possible to do some investigation using specific patterns that produce specific results

Rather than going to that trouble, if it were me, I would simply switch to using a PTX mma instruction, if I really needed the register layout.

nobond · January 23, 2024, 4:07am

so it is downto mma instruction itself to explictly lay-out the regsiter map?

Robert_Crovella · January 23, 2024, 12:29pm

I’m not sure what you mean by “down to the mma instruction itself”. You may wish to read the relevant sections of the PTX doc. Or study an example. You can find examples on these forums. Here is one.

rs277 · January 23, 2024, 6:07pm

I’m not sure if this is of any value.

nobond · January 26, 2024, 4:05am

Exactly sort of thing i am after。 Thanks I have read the citadel research paper long time ago. Is the number solid correct though?

Topic		Replies	Views
Can we directly use register value for tensor core calculation? CUDA Programming and Performance	4	588	October 18, 2023
Get wrong result using tensor core example CUDA Programming and Performance cuda , kernel	8	330	July 1, 2024
WMMA vs. MMA CUDA Programming and Performance	2	762	January 7, 2025
Order of registers in MMA calls CUDA Programming and Performance	2	1189	June 13, 2022
What is the best way to re-use a tensor core C fragment now as A or B input when their types differ? CUDA Programming and Performance	5	718	November 24, 2023
Wmma f16 load always loads into 8 2xf16 registers CUDA Programming and Performance	4	613	September 9, 2023
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	6369	October 18, 2022
Question about efficient usage of wmma CUDA Programming and Performance	2	319	February 29, 2024
Programming Tensor core in RTX4070 CUDA Programming and Performance	1	471	January 18, 2024
About tensor core's flops/clk and wmma shape? CUDA Programming and Performance	1	944	October 22, 2023

CUDA tensor core register mapping?

Related topics