The programming book says that there will be 4 mma.m8n8k4 fp16 calculation been scheduled in a cycle. Does that means I need to have at least 4 mma operation per warp otherwise I can’t make full use of the tensor core?

m8n8k4 fp16 was the very first matrix multiplication instruction of the Volta generation (1st gen.).

It is still supported, but in a slower fashion for newer architectures.

If you want to make full use of the tensor core avoid it.

Let us say you use the m16n8k8 fp16 instruction. That is 1024 FMAs per instruction.

2nd gen: Volta and Turing can do 64 FP16 FMAs/cycle/Tensor Core with 2 Tensor Cores per SM Partition.

3rd/4th gen: Ampere and Ada can do 256 FP16 FMAs/cycle/Tensor Core with 1 Tensor Core per SM Partition (except Consumer Ampere, which can only do 128).

Hopper (4th) can do 512 and Blackwell (at the least the datacenter GPU) 1024 (5th gen.).

You have 4 SM Partitions per SM.

So several warps (at least 4) per SM would execute MMA instructions and a lot of them.

Thanks! If I have four warps per threadblock, and each have one fp16 m16n8k8 operation, can I make a full use of tensor core? If not, can 8 warps per threadblock do?

Four warps per SM probably can make full use of the tensor cores, especially if you do not have Hopper or Blackwell.

But also consider, how the warps load data. If they have to wait for global memory latencies for a long time, your tensor core pipeline can starve (= be not fully occupied).

Also the situation gets even better, if more than one threadblock is loaded into a multiprocessor.

You write “one fp16 m16n8k8 operation” - I hope you mean one at a time?

If it is altogether one, it is hardly worth to call a Cuda kernel for it.

The A100 (in your title) specifically can do 1 fp16 m16n8k8 MMA instructions per SM Partition every 4 cycles. So 4 MMA instructions per SM every 4 cycles. Each of your four warps should provide a MMA instruction every 4 cycles to make full use of the tensor cores.

If your A100 runs with 1410 MHz and your kernel for 1ms, then you need more than 350,000 MMA instructions per warp.

my guess would be that if you attempt to run the 4x m8n8k4 op on an ampere GPU, that there would be no actual Tensor Core utilization at all. You can always check the generated SASS to be sure.

Thanks! I will try to have 8 warps per block

As A100 has a tensor core per SM partition, will they operate together to calculate a single mma operation? For example, I have a mma m16n8k8 calculation on warp 0, will it be calculated by 4 tensor cores on that SM together(than it can be done in a cycle)? Or only the tensor core in that [artition will calculate it, cost 4 cycles?

A Tensor Core (TC) unit only works for the warps assigned to it, i.e. that are in the same SMSP.

In order to use 4 TC units, split across 4 SMSPs, it will require at least 4 warps, each of which is issuing TC ops.

A TC unit belonging to a particular SMSP does not work on warps that belong to other SMSPs.

Also be aware that the 4 cycles is a bandwidth (= the average time for a lot of calculations). The latency can be longer (= from single data input to its result output). That is because the computation units (including the tensor cores) are pipelined.

Thanks!