About tensor core's flops/clk and wmma shape?

spring_wind · October 22, 2023, 2:22am

I read this blog: Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog

Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock.
The FP16 multiply results in a full-precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply

Q1: Is it same for every generation of Tensor core to perform 64 FFMA/clck, if not, is there any link to let me get the detailed information about FFMAs per clock for each TC and its shape?

During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores.

Q2: From programmer view, wmma provide 16x16x16 matrix operation, is that same for all CUDA versions?

Robert_Crovella · October 22, 2023, 9:24pm

No its not the same. This post covers some related calculations and links. I don’t have any comments about the relationship between throughput and shape, but if I were guessing, I would assume it would be the (inverse) ratio of the flops per op, times the base throughput (TC flops/clk)

If it were me, I wouldn’t tie this to “CUDA versions” which to me means things like CUDA 10.x, CUDA 11.x, etc. i.e. software versions. I don’t know of any connection between shape support and CUDA software version per-se (although there is an obvious connection, because supporting newer GPU architectures requires newer minimum CUDA versions.)

The connection is to GPU architecture primarily, AFAIK. So newer architectural generations may offer additional shapes, compared to older architectural generations. To get a sense of this, you could take one of the matrix-multiply instructions at the PTX level, and see what shapes were “introduced” based on architectural generation. For example the mma instruction, scroll down to to the “Target ISA Notes”. We can see in general such instructions required cc7.0 minimum. For that particular instruction, it appears that the only shape offered for cc7.0 was 8x8x4. For that instruction, cc7.5 (Turing) added a 16x8x8 shape/option for f16 work, to pick one example.

Topic		Replies	Views
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1212	July 7, 2023
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2511	August 12, 2017
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	849	November 15, 2023
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1059	April 26, 2018
Tensor cores floating point inaccuracies CUDA Programming and Performance	0	483	September 16, 2020
Programming Tensor Cores in CUDA 9 Technical Blog	14	1081	November 28, 2022
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1000	December 8, 2018
Programming Tensor core in RTX4070 CUDA Programming and Performance	1	424	January 18, 2024
When working on elements of fragments directly, is it computed inside tensor core or CUDA core? CUDA Programming and Performance	2	39	September 14, 2024
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	1569	May 15, 2024

About tensor core's flops/clk and wmma shape?

Related topics