going through the simpleTensorCoreGEMM.cu example. I wonder how many SM’s are used to perform the 16x16 matrix multiplication accumulation. Have an RTX 2070.

I understand that each tensor core performs 4x4 mma, GTX 2070 has 8 of these tensor cores per SM.

Question1: Have 2 theories (below), wonder which one is the correct:

Theory 1: One SM does it all 16x16=256 product, 4 iterations of cuda threads to accommodate for the 2x8 Tensor cores in a 16x16 required (think this is unlikely) as it would take 2 steps of Tensor computing.

Theory 2: Two SM’s, 2 iterations of cuda threads per SM to accommodate the 8 Tensor cores to compute 128 products per SM of the 16x16=256 required

(most likely scenario, in 2 cuda core warp cycles) as it would take 1 step of Tensor computing.

so… 2 SM’s , with 16 tensor cores for the 16x16 half floating point MMA operation? 2 `cuda core warps'' per SM but only one `

tensor core warp’’.

Question2: Is there is such thing as `cuda cores warp'' and `

tensor cores warp’’?.

thank’s, excellent product by the way.