About the relationship between warp and tensor_core

Shaquille · July 6, 2023, 2:52am

hello, nv’s experts.
I have some question about A100’s tensor in this document: <nvidia-ampere-architecture-whitepaper.pdf>, as following:

Each of the A100 Tensor Cores can execute 256 FP16 FMA operations
per clock, allowing it to compute the results for an 8x4x8 mixed-precision matrix multiplication
per clock. Each SM in the A100 GPU includes four of the new redesigned Tensor Cores and
therefore each SM in A100 delivers 1024 FP16 FMA operations per clock (or 2048 individual
FP16 floating point operations per clock)

it says each SM of A100 have 4 tensor_core.
each SM of A100 have 64 cuda_core, so, its warp should be 2. So, I’m very confused, why the tensor_core is 4?
the “mma” is organized by warp-level if we want to program with cuda-C。
So，why 4 tensor_core can match 2 warp in each SM of A100

Robert_Crovella · July 6, 2023, 3:02am

tensor core is a functional unit, just like cuda cores are a functional unit. There isn’t any particular connection between the two, just like there is no particular connection between load/store unit and special function unit.

The number of warps is not connected to the number of functional units of any particular type. So an A100 SM can have many warps in flight, or selectable.

Finally, whereas you need 32 cuda cores to support a single instruction of the type FADD, FMUL, or FFMA warp-wide, you only need a single TC unit to support a tensor core instruction warp-wide.

Shaquille · July 6, 2023, 3:06am

thank you

Shaquille · July 6, 2023, 4:25am

hello, Robert
I think there is still some question in my heart.
there are just 2 warps on each SM, how cuda can issue 4 tensor_core instruction in parallel？
for further detail:

each SM just have 64 cuda_cores(2 warps), so, it can only prepare 2 sets data for tensor_core in parallel, is it right?
I think 4 tensor_cores need at least 4 sets data, if they want to work in parallel. is it right?
I don’t know how to explain above
Would you like to teach me? how the 4 TC can work in parallel?

Greg · July 6, 2023, 5:41am

Each GA100 SM has 4 sub-partitions. Each sub-partition has a warp scheduler, register file, 16 lanes of fma pipe (CUDA cores), 16 lanes of alu pipe, 16 warp ids, …

A warp is an entity consisting of state including registers, program counter, active mask, and per lane thread state. Warps are schedule and dispatched to pipelined execution units such as the fma pipe (CUDA cores), alu, sfu/xu, imma/hmma (tensor cores), fp64, adu, lsu, etc.

Execution pipes are not all 32 lanes wide. For example the fma pipe (CUDA cores) is 16 lanes wide so a warp is dispatched over 2 cycles.

The warp scheduler can select a different warp each cycle in order to hide dependent latency and pipeline issue latency.

Shaquille · July 6, 2023, 7:55am

Hello, Greg:

thanks for your help.

For example the fma pipe (CUDA cores) is 16 lanes wide so a warp is dispatched over 2 cycles

usually, we think 1 cuda_core can execute 1 fma per cycle. I think it is conflict with what you said.
how to understand what you said?

Robert_Crovella · July 6, 2023, 2:45pm

typically, if e.g a FMUL, FADD, or FFMA instruction is issued warp-wide, then we need 32 such calculations to satisfy the needs of the warp. Since each “cuda core” can support 1 fma per cycle, then to handle the needs of the warp for a single FFMA instruction, we would need 32 of these. If there are 32 available in a particular SMSP, then the instruction could be scheduled using all 32 of those in a single clock cycle. If the SMSP does not have 32, but has instead, 16, then it will require 2 clock cycles, using those 16 “cuda cores”, over 2 cycles, to meet the needs of that FFMA instruction, warp-wide.

Shaquille · July 7, 2023, 3:50am

thanks for you help

Topic		Replies	Views
Mma m8n8k4 on A100 CUDA Programming and Performance	10	134	November 14, 2024
Mma instructions on A100 CUDA Programming and Performance	5	142	October 1, 2024
Cuda operations along side Tensor operations CUDA Programming and Performance	2	478	October 12, 2021
Tensor core, is my analysis correct? CUDA Programming and Performance	2	59	February 5, 2025
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	602	November 25, 2024
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28706	July 4, 2019
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15589	February 4, 2011
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	975	November 15, 2023
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	452	August 20, 2024
Question on CTA Execution and Tensor Core Parallelism CUDA Programming and Performance	1	41	September 23, 2024

About the relationship between warp and tensor_core

Related topics