There are several different data format supported by A100 mma. instruction like m8n32k16, m16n16k16. However, I find that there are only 4 tensor core per SM on the A100. I just wonder how many mma instructions like m8n32k16 can make a fully use of a SM? What is the preferable num? For example, if I just have 2 threadblocks on a SM and each block has 4 warps, I identify this method as there are 8 mma instructions on a SM.
Normally the Tensor Cores execute 1 tensor core instruction at a time (+ pipelining). There are two pipelines for INT and FP Tensor Cores, but in my experience they cannot be used concurrently to increase the instruction throughput. It could be that the A100 has a separate FP64 Tensor Core pipeline, but would probably also not increase instruction throughput.
Using different shapes for the same data formats will definitely not run in parallel.
Each of the 4 SM Partitions has one tensor core. But run enough warps per SM Partition to make sure they get full occupation.
So how many warps is usually recommanded to have per SM? Is 4 enough?
Usually between 8 and 32 are good numbers. If you have a Tensor Core heavy warp, then 4 can give quite good results already, but better have at least 8. If you acccess global device memory, then it depends on the amount of caching possible.
Thanks!
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.