Tensor core mechanism

half-0 · February 21, 2025, 3:59am

Does tensor core has different calculate method for different data type, or does everything was seen as u_int32? I noticed that when use mma.m16n8k16.f16.f16.f16.f16, every register pointer needs to be reinterpreted into unsigned type, though they may actually in half/__half2 type. I just wonder what will happened if I use uncorresponding type of mma to calculate data (for example, u8 for mma and fp16 for data).

half-0 · February 21, 2025, 4:01am

I’m now trying to seeking a way to use tensor core’s calculation to realize data shfl between threads. So it will be really helpful if I can use lower precision mma operation to do that as it can provide more flexibility in data shfl.

Curefab · February 21, 2025, 12:54pm

u_int32 ist just how the PTX instructions are specified. The interpretation of the bits is according to the operand size in the instruction, e.g. f16 and stored as stated in the PTX manual, e.g. as half2.

There is no type conversion involved.
(With the exception of the MMA instruction itself, which e.g. could sum up s8 values as int32 or sum up fp16 as fp32.)

If you want to use it for distributing data, you need an operation, which keeps the data bits, e.g. multiplying with 1 or 1.0, and one for skipping the data, e.g. multiplying with 0 or 0.0.

(In theory you could distribute in a more complicated way, e.g. A + B and A - B and then restoring values. But that probably would not help much compared to distributing only A or only B.)

For reordering data, often two MMA instructions are combined in series, to distribute separately to the right thread % 4 and then thread / 4 as in one case the data is put into A, in the other case into B, with the other value the control data, where to send to.

half-0 · February 21, 2025, 1:44pm

So will it be more quickly to use tensor core to reorder the data then by using shfl.sync? I haven’t seen many project reorder the data in this way.

Curefab · February 21, 2025, 2:50pm

I think there was a scientific paper about using Tensor Cores for reordering.

It also depends, whether your overall bottleneck is shared memory and shuffle (they both need the same resources). Then taking some load from it for reordering can be beneficial.

I would use shfl as standard method.

Depending on GPU generation, with some non-datacenter GPUs full use of tensor cores could be half as fast for shuffling as full use of the shfl instruction. Could be more or less depending on the details.

It is not very efficient to use tensor cores for reordering (considering the many multiplications with zero), but tensor cores are fast overall, and often not fully utilized.

It also depends, whether your reordering can be done with one mma instruction (only column-wise or only row-wise) instead of with two.

Also do you profit from the data types < 4 bytes (i.e. 1 byte or 2 bytes or 1 bits; or 4 bits with newer GPUs). Which can be gathered and scattered among the threads with tensor cores.

The B100 has special 2D Tensor Memory. I have not looked into it, if that could be further used, see e.g. 1. Introduction — PTX ISA 8.7 documentation

half-0 · February 21, 2025, 2:56pm

For A100，mma m16n8k16.fp16.fp16.fp16.fp16 can swap 256B per cycle but shfl need 2 cycles to do that. Also, as tensor core are always not fully utilized, I think it can be a good idea to use tensor core to reordering the data in registers. Do you have any good examples? I can’t find a lot.

Curefab · February 21, 2025, 3:05pm

Yes, the A100 datacenter GPU has faster Tensor Cores than consumer GPUs with the same overall SM or shfl speed. (The difference gets larger with newer generations.)

No, I do not have good examples at hand and do not remember the name of the paper.

But it is kind of straight-forward. Look in the PTX manual, which threads provide which elements for the A and B matrices, and which thread receives the resulting elements. The involved threads are quite similar between the different data types and matrix sizes, probably due to hardware data paths.

The mentioned mma instruction can be executed once within 2 cycles per SM on the A100, I believe, and generates 256 bytes of output. And for general shuffles you would need to call two mma instructions. So it is half as fast as shfl?

But each thread can provide several values with each mma instruction, so additional reordering can also be integrated into the mma operation.

half-0 · February 21, 2025, 3:08pm

I’m sorry, I write the wrong instruction. I mean mma.m16n8k8.fp16.fp16.fp16.fp16 can reorder 8 cols of data, maybe it will be faster then shfl?

Curefab · February 21, 2025, 3:18pm

Often it is not necessary for the mma instruction to be faster than shfl. If shfl and shared memory is the bottleneck and shuffling with mma is half as fast, then you can get 50% algorithm bandwidth speed-up (or 1/3 shorter time) by using shfl and mma at the same time for different warps, as the used resources are separate and different warps can use them in parallel. If they have the same speed, even better.

Curefab · February 21, 2025, 3:26pm

This instruction takes 4 cycles per SM partition or is executed once per cycle per SM.

It generates 128 half values or 256 bytes of output.

shfl is executed once per cycle per SM and generates 128 bytes of output.

If you need 2 mma instructions for arbitrary shuffle, they have the same speed (but in that case the mma has more flexibility how the reshuffling is done).

If you use any shared memory, there are less resources for the shuffle.

half-0 · February 22, 2025, 1:40pm

Thanks! I get it!

Topic		Replies	Views
Operation time compare CUDA Programming and Performance	4	40	December 16, 2024
Conflict of shfl.sync() CUDA Programming and Performance	16	103	December 15, 2024
About compute accuracy CUDA Programming and Performance	22	65	February 10, 2025
What is the best way to re-use a tensor core C fragment now as A or B input when their types differ? CUDA Programming and Performance	5	720	November 24, 2023
Get wrong result using tensor core example CUDA Programming and Performance cuda , kernel	8	334	July 1, 2024
Is there a support for copy from shared memory to global memory without using registers? CUDA Programming and Performance cuda	7	177	October 9, 2024
Wrong answer with mma.sync.aligned.m8n8k4 CUDA Programming and Performance cuda , kernel	8	1325	April 17, 2023
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	126	January 10, 2025
Load data for tensor core CUDA Programming and Performance	23	58	February 5, 2025
I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada? CUDA Programming and Performance cuda , tensorflow , rtx , ampere	10	1902	September 27, 2024

Tensor core mechanism

Related topics