About the relationship between warp and tensor_core

Robert_Crovella · July 6, 2023, 2:45pm

typically, if e.g a FMUL, FADD, or FFMA instruction is issued warp-wide, then we need 32 such calculations to satisfy the needs of the warp. Since each “cuda core” can support 1 fma per cycle, then to handle the needs of the warp for a single FFMA instruction, we would need 32 of these. If there are 32 available in a particular SMSP, then the instruction could be scheduled using all 32 of those in a single clock cycle. If the SMSP does not have 32, but has instead, 16, then it will require 2 clock cycles, using those 16 “cuda cores”, over 2 cycles, to meet the needs of that FFMA instruction, warp-wide.

Topic		Replies	Views
Mma m8n8k4 on A100 CUDA Programming and Performance	10	61	November 14, 2024
Mma instructions on A100 CUDA Programming and Performance	5	99	October 1, 2024
Cuda operations along side Tensor operations CUDA Programming and Performance	2	471	October 12, 2021
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	454	November 25, 2024
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28594	July 4, 2019
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15524	February 4, 2011
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	854	November 15, 2023
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	259	August 20, 2024
Question on CTA Execution and Tensor Core Parallelism CUDA Programming and Performance	1	32	September 23, 2024
About tensor core's flops/clk and wmma shape? CUDA Programming and Performance	1	836	October 22, 2023

About the relationship between warp and tensor_core

Related topics