Turing 16x16 MMA, SM usage, 1 or 2?

going through the simpleTensorCoreGEMM.cu example. I wonder how many SM’s are used to perform the 16x16 matrix multiplication accumulation. Have an RTX 2070.

I understand that each tensor core performs 4x4 mma, GTX 2070 has 8 of these tensor cores per SM.

Question1: Have 2 theories (below), wonder which one is the correct:

Theory 1: One SM does it all 16x16=256 product, 4 iterations of cuda threads to accommodate for the 2x8 Tensor cores in a 16x16 required (think this is unlikely) as it would take 2 steps of Tensor computing.

Theory 2: Two SM’s, 2 iterations of cuda threads per SM to accommodate the 8 Tensor cores to compute 128 products per SM of the 16x16=256 required
(most likely scenario, in 2 cuda core warp cycles) as it would take 1 step of Tensor computing.

so… 2 SM’s , with 16 tensor cores for the 16x16 half floating point MMA operation? 2 cuda core warps'' per SM but only one tensor core warp’'.

Question2: Is there is such thing as cuda cores warp'' and tensor cores warp’'?.

thank’s, excellent product by the way.

At the lowest level, tensor core activity is driven from CUDA code.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

There is no other method to access them. Any method that uses them, such as CUBLAS, CUTLASS, etc. is using CUDA wmma operations under the hood (i.e. in some CUDA kernel, in the library).

There are no “tensor core warps”. The only warp is a CUDA warp.

Therefore, in order for tensor core units from separate SM’s to be used, the underlying CUDA kernel would have to be using at least 2 or more SMs, which means it has at least 2 or more threadblocks.

CUDA wmma activity issuing from a given threadblock only has access to the tensor core units that are in the SM that threadblock is resident on.

None of this answers your question directly, for the specific case. I don’t intend to try.

If you present some actual CUDA kernel code, the question is answerable.

If you are referring to access via a library routine, the answer will depend on the implementation of that library, which could change from one version to the next.

For the example which I am assuming you are referring to:

https://github.com/parallel-forall/code-samples/blob/master/posts/tensor-cores/simpleTensorCoreGEMM.cu

that example uses both straight cuda coding (the wmma portion, and the kernel) and also a library call (cublasGemmEx). The answer could be different for both (although I doubt it). In the case of the kernel implementation, there is this comment in the code:

// 128x4 means we have 16 warps and a block computes a 64x64 output tile

Got it!
Thanks Robert

// 128x4 means we have 16 warps and a block computes a 64x64 output tile

I rather see it as 16 x (16x16)= 64x64 (although not the same units on both sides of equation),
as it would have been more intuitive?.

i.e. 16 warps (per block, on a 4x4 2-D warp sub-index) acting on 16x16 data sub-tile’s. One warp 32 cuda-core threads accommodating K running index (size 16) data, onto the tensor core’s to compute one sub-tile. Then composing all sub-tiles to get the 64x64 tile.

It all make sense now,
It works great!.

regards,

Joel Rodriguez