Mma instructions on A100

half-0 · September 15, 2024, 10:52am

There are several different data format supported by A100 mma. instruction like m8n32k16, m16n16k16. However, I find that there are only 4 tensor core per SM on the A100. I just wonder how many mma instructions like m8n32k16 can make a fully use of a SM? What is the preferable num? For example, if I just have 2 threadblocks on a SM and each block has 4 warps, I identify this method as there are 8 mma instructions on a SM.

Curefab · September 15, 2024, 1:54pm

Normally the Tensor Cores execute 1 tensor core instruction at a time (+ pipelining). There are two pipelines for INT and FP Tensor Cores, but in my experience they cannot be used concurrently to increase the instruction throughput. It could be that the A100 has a separate FP64 Tensor Core pipeline, but would probably also not increase instruction throughput.

Using different shapes for the same data formats will definitely not run in parallel.

Each of the 4 SM Partitions has one tensor core. But run enough warps per SM Partition to make sure they get full occupation.

half-0 · September 15, 2024, 4:47pm

So how many warps is usually recommanded to have per SM? Is 4 enough?

Curefab · September 16, 2024, 12:51pm

Usually between 8 and 32 are good numbers. If you have a Tensor Core heavy warp, then 4 can give quite good results already, but better have at least 8. If you acccess global device memory, then it depends on the amount of caching possible.

half-0 · September 17, 2024, 2:03pm

Thanks!

system · October 1, 2024, 2:04pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mma m8n8k4 on A100 CUDA Programming and Performance	10	134	November 14, 2024
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1364	July 7, 2023
Question on CTA Execution and Tensor Core Parallelism CUDA Programming and Performance	1	41	September 23, 2024
Operation time compare CUDA Programming and Performance	4	38	December 16, 2024
Cuda operations along side Tensor operations CUDA Programming and Performance	2	478	October 12, 2021
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	453	August 20, 2024
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	976	November 15, 2023
How to choose from so many `mma.mnk*` instructions? CUDA Programming and Performance	3	291	June 19, 2024
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1014	December 8, 2018
Tensor core, is my analysis correct? CUDA Programming and Performance	2	60	February 5, 2025

Mma instructions on A100

Related topics