About the relationship between warp and tensor_core

typically, if e.g a FMUL, FADD, or FFMA instruction is issued warp-wide, then we need 32 such calculations to satisfy the needs of the warp. Since each “cuda core” can support 1 fma per cycle, then to handle the needs of the warp for a single FFMA instruction, we would need 32 of these. If there are 32 available in a particular SMSP, then the instruction could be scheduled using all 32 of those in a single clock cycle. If the SMSP does not have 32, but has instead, 16, then it will require 2 clock cycles, using those 16 “cuda cores”, over 2 cycles, to meet the needs of that FFMA instruction, warp-wide.