For A100, I would like to know how many cycles will cost for the following operations:
- fp16 mma.m16m8k8
- __shfl.sync
- load 4B data from shared memory without bank conflict
For A100, I would like to know how many cycles will cost for the following operations:
For 1. See Fig. 7 of this paper.
For 3. See Table 10 of same paper.
So it is not easy to load enough data to occupy the Tensor Cores.
So does that means if I need to do lot of matrix multiplication A * B, where B is a permutation matrix. Even though each time the multiplication size is 16 * 8 * 8 and can be operate on tensor core, I still should load the data into registers and use shfl instead?
Tensor Core operations always work on registers.
It means that 1 such MMA instruction has A with 128 values (256 bytes, 8 bytes/thread) and B with 64 values (128 bytes, 4 bytes/thread).
If you apply each A to just one fixed B, and not to several, and if you transfer A over shared memory, then just reading the A matrices (or alternatively shuffle them around) is slower than the actual MMA instructions.
You can accept that or try to reuse the A matrices for more than just a single Tensor Core instruction. (You will probably anyway in some way in your algorithm, otherwise not shared memory, but reading from global memory is the bottleneck, if A is not computed on-the-fly).