hello, nv’s experts.
I have some question about A100’s tensor in this document: <nvidia-ampere-architecture-whitepaper.pdf>, as following:

Each of the A100 Tensor Cores can execute 256 FP16 FMA operations
per clock, allowing it to compute the results for an 8x4x8 mixed-precision matrix multiplication
per clock. Each SM in the A100 GPU includes four of the new redesigned Tensor Cores and
therefore each SM in A100 delivers 1024 FP16 FMA operations per clock (or 2048 individual
FP16 floating point operations per clock)

it says each SM of A100 have 4 tensor_core.
each SM of A100 have 64 cuda_core, so, its warp should be 2. So, I’m very confused, why the tensor_core is 4?
the “mma” is organized by warp-level if we want to program with cuda-C。
So，why 4 tensor_core can match 2 warp in each SM of A100

tensor core is a functional unit, just like cuda cores are a functional unit. There isn’t any particular connection between the two, just like there is no particular connection between load/store unit and special function unit.

The number of warps is not connected to the number of functional units of any particular type. So an A100 SM can have many warps in flight, or selectable.

Finally, whereas you need 32 cuda cores to support a single instruction of the type FADD, FMUL, or FFMA warp-wide, you only need a single TC unit to support a tensor core instruction warp-wide.

hello, Robert
I think there is still some question in my heart.
there are just 2 warps on each SM, how cuda can issue 4 tensor_core instruction in parallel？
for further detail:

each SM just have 64 cuda_cores(2 warps), so, it can only prepare 2 sets data for tensor_core in parallel, is it right?

I think 4 tensor_cores need at least 4 sets data, if they want to work in parallel. is it right?
I don’t know how to explain above
Would you like to teach me? how the 4 TC can work in parallel?

Each GA100 SM has 4 sub-partitions. Each sub-partition has a warp scheduler, register file, 16 lanes of fma pipe (CUDA cores), 16 lanes of alu pipe, 16 warp ids, …

A warp is an entity consisting of state including registers, program counter, active mask, and per lane thread state. Warps are schedule and dispatched to pipelined execution units such as the fma pipe (CUDA cores), alu, sfu/xu, imma/hmma (tensor cores), fp64, adu, lsu, etc.

Execution pipes are not all 32 lanes wide. For example the fma pipe (CUDA cores) is 16 lanes wide so a warp is dispatched over 2 cycles.

The warp scheduler can select a different warp each cycle in order to hide dependent latency and pipeline issue latency.

typically, if e.g a FMUL, FADD, or FFMA instruction is issued warp-wide, then we need 32 such calculations to satisfy the needs of the warp. Since each “cuda core” can support 1 fma per cycle, then to handle the needs of the warp for a single FFMA instruction, we would need 32 of these. If there are 32 available in a particular SMSP, then the instruction could be scheduled using all 32 of those in a single clock cycle. If the SMSP does not have 32, but has instead, 16, then it will require 2 clock cycles, using those 16 “cuda cores”, over 2 cycles, to meet the needs of that FFMA instruction, warp-wide.