Tensor core mechanism

Does tensor core has different calculate method for different data type, or does everything was seen as u_int32? I noticed that when use mma.m16n8k16.f16.f16.f16.f16, every register pointer needs to be reinterpreted into unsigned type, though they may actually in half/__half2 type. I just wonder what will happened if I use uncorresponding type of mma to calculate data (for example, u8 for mma and fp16 for data).

I’m now trying to seeking a way to use tensor core’s calculation to realize data shfl between threads. So it will be really helpful if I can use lower precision mma operation to do that as it can provide more flexibility in data shfl.

u_int32 ist just how the PTX instructions are specified. The interpretation of the bits is according to the operand size in the instruction, e.g. f16 and stored as stated in the PTX manual, e.g. as half2.

There is no type conversion involved.
(With the exception of the MMA instruction itself, which e.g. could sum up s8 values as int32 or sum up fp16 as fp32.)

If you want to use it for distributing data, you need an operation, which keeps the data bits, e.g. multiplying with 1 or 1.0, and one for skipping the data, e.g. multiplying with 0 or 0.0.

(In theory you could distribute in a more complicated way, e.g. A + B and A - B and then restoring values. But that probably would not help much compared to distributing only A or only B.)

For reordering data, often two MMA instructions are combined in series, to distribute separately to the right thread % 4 and then thread / 4 as in one case the data is put into A, in the other case into B, with the other value the control data, where to send to.

So will it be more quickly to use tensor core to reorder the data then by using shfl.sync? I haven’t seen many project reorder the data in this way.

I think there was a scientific paper about using Tensor Cores for reordering.

It also depends, whether your overall bottleneck is shared memory and shuffle (they both need the same resources). Then taking some load from it for reordering can be beneficial.

I would use shfl as standard method.

Depending on GPU generation, with some non-datacenter GPUs full use of tensor cores could be half as fast for shuffling as full use of the shfl instruction. Could be more or less depending on the details.

It is not very efficient to use tensor cores for reordering (considering the many multiplications with zero), but tensor cores are fast overall, and often not fully utilized.

It also depends, whether your reordering can be done with one mma instruction (only column-wise or only row-wise) instead of with two.

Also do you profit from the data types < 4 bytes (i.e. 1 byte or 2 bytes or 1 bits; or 4 bits with newer GPUs). Which can be gathered and scattered among the threads with tensor cores.

The B100 has special 2D Tensor Memory. I have not looked into it, if that could be further used, see e.g. 1. Introduction — PTX ISA 8.7 documentation

For A100,mma m16n8k16.fp16.fp16.fp16.fp16 can swap 256B per cycle but shfl need 2 cycles to do that. Also, as tensor core are always not fully utilized, I think it can be a good idea to use tensor core to reordering the data in registers. Do you have any good examples? I can’t find a lot.

Yes, the A100 datacenter GPU has faster Tensor Cores than consumer GPUs with the same overall SM or shfl speed. (The difference gets larger with newer generations.)

No, I do not have good examples at hand and do not remember the name of the paper.

But it is kind of straight-forward. Look in the PTX manual, which threads provide which elements for the A and B matrices, and which thread receives the resulting elements. The involved threads are quite similar between the different data types and matrix sizes, probably due to hardware data paths.

The mentioned mma instruction can be executed once within 2 cycles per SM on the A100, I believe, and generates 256 bytes of output. And for general shuffles you would need to call two mma instructions. So it is half as fast as shfl?

But each thread can provide several values with each mma instruction, so additional reordering can also be integrated into the mma operation.

I’m sorry, I write the wrong instruction. I mean mma.m16n8k8.fp16.fp16.fp16.fp16 can reorder 8 cols of data, maybe it will be faster then shfl?

Often it is not necessary for the mma instruction to be faster than shfl. If shfl and shared memory is the bottleneck and shuffling with mma is half as fast, then you can get 50% algorithm bandwidth speed-up (or 1/3 shorter time) by using shfl and mma at the same time for different warps, as the used resources are separate and different warps can use them in parallel. If they have the same speed, even better.

This instruction takes 4 cycles per SM partition or is executed once per cycle per SM.

It generates 128 half values or 256 bytes of output.

shfl is executed once per cycle per SM and generates 128 bytes of output.

If you need 2 mma instructions for arbitrary shuffle, they have the same speed (but in that case the mma has more flexibility how the reshuffling is done).

If you use any shared memory, there are less resources for the shuffle.

Thanks! I get it!