Mma m8n8k4 on A100

half-0 · October 28, 2024, 3:51pm

The programming book says that there will be 4 mma.m8n8k4 fp16 calculation been scheduled in a cycle. Does that means I need to have at least 4 mma operation per warp otherwise I can’t make full use of the tensor core?

Curefab · October 28, 2024, 4:07pm

m8n8k4 fp16 was the very first matrix multiplication instruction of the Volta generation (1st gen.).
It is still supported, but in a slower fashion for newer architectures.
If you want to make full use of the tensor core avoid it.

Let us say you use the m16n8k8 fp16 instruction. That is 1024 FMAs per instruction.

2nd gen: Volta and Turing can do 64 FP16 FMAs/cycle/Tensor Core with 2 Tensor Cores per SM Partition.
3rd/4th gen: Ampere and Ada can do 256 FP16 FMAs/cycle/Tensor Core with 1 Tensor Core per SM Partition (except Consumer Ampere, which can only do 128).
Hopper (4th) can do 512 and Blackwell (at the least the datacenter GPU) 1024 (5th gen.).

You have 4 SM Partitions per SM.

So several warps (at least 4) per SM would execute MMA instructions and a lot of them.

half-0 · October 28, 2024, 4:14pm

Thanks! If I have four warps per threadblock, and each have one fp16 m16n8k8 operation, can I make a full use of tensor core? If not, can 8 warps per threadblock do?

Curefab · October 28, 2024, 4:17pm

Four warps per SM probably can make full use of the tensor cores, especially if you do not have Hopper or Blackwell.
But also consider, how the warps load data. If they have to wait for global memory latencies for a long time, your tensor core pipeline can starve (= be not fully occupied).

Also the situation gets even better, if more than one threadblock is loaded into a multiprocessor.

You write “one fp16 m16n8k8 operation” - I hope you mean one at a time?
If it is altogether one, it is hardly worth to call a Cuda kernel for it.

The A100 (in your title) specifically can do 1 fp16 m16n8k8 MMA instructions per SM Partition every 4 cycles. So 4 MMA instructions per SM every 4 cycles. Each of your four warps should provide a MMA instruction every 4 cycles to make full use of the tensor cores.

If your A100 runs with 1410 MHz and your kernel for 1ms, then you need more than 350,000 MMA instructions per warp.

Robert_Crovella · October 28, 2024, 6:02pm

my guess would be that if you attempt to run the 4x m8n8k4 op on an ampere GPU, that there would be no actual Tensor Core utilization at all. You can always check the generated SASS to be sure.

half-0 · October 29, 2024, 8:00am

Thanks! I will try to have 8 warps per block

half-0 · October 29, 2024, 8:06am

As A100 has a tensor core per SM partition, will they operate together to calculate a single mma operation? For example, I have a mma m16n8k8 calculation on warp 0, will it be calculated by 4 tensor cores on that SM together(than it can be done in a cycle)? Or only the tensor core in that [artition will calculate it, cost 4 cycles?

Robert_Crovella · October 29, 2024, 1:12pm

A Tensor Core (TC) unit only works for the warps assigned to it, i.e. that are in the same SMSP.

In order to use 4 TC units, split across 4 SMSPs, it will require at least 4 warps, each of which is issuing TC ops.

A TC unit belonging to a particular SMSP does not work on warps that belong to other SMSPs.

Curefab · October 29, 2024, 4:28pm

Also be aware that the 4 cycles is a bandwidth (= the average time for a lot of calculations). The latency can be longer (= from single data input to its result output). That is because the computation units (including the tensor cores) are pipelined.

half-0 · October 31, 2024, 1:48pm

Thanks!

system · November 14, 2024, 1:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1615	July 7, 2023
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1068	December 8, 2018
Cuda operations along side Tensor operations CUDA Programming and Performance	2	521	October 12, 2021
Mma instructions on A100 CUDA Programming and Performance	5	243	October 1, 2024
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15734	February 4, 2011
Warp Size Question CUDA Programming and Performance	21	14178	June 18, 2010
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2647	August 12, 2017
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	1119	November 15, 2023
Threads per warp vs number of cores CUDA Programming and Performance	2	2636	February 3, 2009
Warp thread Scheduling CUDA Programming and Performance	7	2309	June 28, 2010

Mma m8n8k4 on A100

Related topics