going through the simpleTensorCoreGEMM.cu example. I wonder how many SM’s are used to perform the 16x16 matrix multiplication accumulation. Have an RTX 2070.
I understand that each tensor core performs 4x4 mma, GTX 2070 has 8 of these tensor cores per SM.
Question1: Have 2 theories (below), wonder which one is the correct:
Theory 1: One SM does it all 16x16=256 product, 4 iterations of cuda threads to accommodate for the 2x8 Tensor cores in a 16x16 required (think this is unlikely) as it would take 2 steps of Tensor computing.
Theory 2: Two SM’s, 2 iterations of cuda threads per SM to accommodate the 8 Tensor cores to compute 128 products per SM of the 16x16=256 required
(most likely scenario, in 2 cuda core warp cycles) as it would take 1 step of Tensor computing.
so… 2 SM’s , with 16 tensor cores for the 16x16 half floating point MMA operation? 2 cuda core warps'' per SM but only one
tensor core warp’'.
Question2: Is there is such thing as cuda cores warp'' and
tensor cores warp’'?.
thank’s, excellent product by the way.