Is there a way i can get a feeling how the wmma fragment is mapped into the normal cuda register file?
I mean we know the wmma is a warp-level primitive and you would think for a case whereby a 2 4x4 matrix multiple multiply, it would map the fragment into the normal regsiter file which is normally allcoated as per thread one.
Interesting to know is there any intuition here to help to understand?
For certain tensor core ops, the PTX guide will spell it out for you. There are at least 3 varieties of TC ops. wmma, mma, and wgmma
The wmma variant is the one exposed via CUDA C++ intrinsics, and that one (whether in CUDA C++ or PTX) has an intentionally opaque register footprint. The mma ops in PTX, for example, have a register footprint that is specified, you can find it in the PTX doc.
If you want to know the layout of the wmma ops, you will need to refer to unofficial sources/methods. I don’t have any to refer you to, but it seems possible to do some investigation using specific patterns that produce specific results
Rather than going to that trouble, if it were me, I would simply switch to using a PTX mma instruction, if I really needed the register layout.
I’m not sure what you mean by “down to the mma instruction itself”. You may wish to read the relevant sections of the PTX doc. Or study an example. You can find examples on these forums. Here is one.