Turing 16x16 MMA, SM usage, 1 or 2?

joelrod · December 6, 2018, 9:11am

going through the simpleTensorCoreGEMM.cu example. I wonder how many SM’s are used to perform the 16x16 matrix multiplication accumulation. Have an RTX 2070.

I understand that each tensor core performs 4x4 mma, GTX 2070 has 8 of these tensor cores per SM.

Question1: Have 2 theories (below), wonder which one is the correct:

Theory 1: One SM does it all 16x16=256 product, 4 iterations of cuda threads to accommodate for the 2x8 Tensor cores in a 16x16 required (think this is unlikely) as it would take 2 steps of Tensor computing.

Theory 2: Two SM’s, 2 iterations of cuda threads per SM to accommodate the 8 Tensor cores to compute 128 products per SM of the 16x16=256 required
(most likely scenario, in 2 cuda core warp cycles) as it would take 1 step of Tensor computing.

so… 2 SM’s , with 16 tensor cores for the 16x16 half floating point MMA operation? 2 cuda core warps'' per SM but only one tensor core warp’'.

Question2: Is there is such thing as cuda cores warp'' and tensor cores warp’'?.

thank’s, excellent product by the way.

Robert_Crovella · December 7, 2018, 4:11pm

At the lowest level, tensor core activity is driven from CUDA code.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

There is no other method to access them. Any method that uses them, such as CUBLAS, CUTLASS, etc. is using CUDA wmma operations under the hood (i.e. in some CUDA kernel, in the library).

There are no “tensor core warps”. The only warp is a CUDA warp.

Therefore, in order for tensor core units from separate SM’s to be used, the underlying CUDA kernel would have to be using at least 2 or more SMs, which means it has at least 2 or more threadblocks.

CUDA wmma activity issuing from a given threadblock only has access to the tensor core units that are in the SM that threadblock is resident on.

None of this answers your question directly, for the specific case. I don’t intend to try.

If you present some actual CUDA kernel code, the question is answerable.

If you are referring to access via a library routine, the answer will depend on the implementation of that library, which could change from one version to the next.

For the example which I am assuming you are referring to:

https://github.com/parallel-forall/code-samples/blob/master/posts/tensor-cores/simpleTensorCoreGEMM.cu

that example uses both straight cuda coding (the wmma portion, and the kernel) and also a library call (cublasGemmEx). The answer could be different for both (although I doubt it). In the case of the kernel implementation, there is this comment in the code:

// 128x4 means we have 16 warps and a block computes a 64x64 output tile

joelrod · December 8, 2018, 7:31am

Got it!
Thanks Robert

// 128x4 means we have 16 warps and a block computes a 64x64 output tile

I rather see it as 16 x (16x16)= 64x64 (although not the same units on both sides of equation),
as it would have been more intuitive?.

i.e. 16 warps (per block, on a 4x4 2-D warp sub-index) acting on 16x16 data sub-tile’s. One warp 32 cuda-core threads accommodating K running index (size 16) data, onto the tensor core’s to compute one sub-tile. Then composing all sub-tiles to get the 64x64 tile.

It all make sense now,
It works great!.

regards,

Joel Rodriguez

Topic		Replies	Views
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	823	November 15, 2023
What will be happen in the situation CUDA Programming and Performance	9	6240	December 23, 2008
How the 16 int cores in a processing block in SM execute when 32 integers in a warp is calculated? CUDA Programming and Performance cuda , board-design	4	967	September 28, 2023
Relationship between Threads and GPU core/units CUDA Programming and Performance	5	6384	November 21, 2015
Cuda Cores Cuda Cores - run threads bloocks, kernels etc. CUDA Programming and Performance	5	1727	February 22, 2011
Basic Cuda Confusion - help CUDA Programming and Performance	9	1894	February 11, 2013
Scheduling Thread Blocks CUDA Programming and Performance	5	1128	July 29, 2021
Question on CTA Execution and Tensor Core Parallelism CUDA Programming and Performance	1	31	September 23, 2024
Wave Quantization WMMA CUDA Programming and Performance	4	95	August 2, 2024
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28560	July 4, 2019

Turing 16x16 MMA, SM usage, 1 or 2?

Related topics