Operation time compare

For A100, I would like to know how many cycles will cost for the following operations:

  1. fp16 mma.m16m8k8
  2. __shfl.sync
  3. load 4B data from shared memory without bank conflict

For 1. See Fig. 7 of this paper.
For 3. See Table 10 of same paper.

  1. Each of the 4 SM partitions can do the mentioned MMA instruction every 4 cycles, so on average 1 instruction per SM per cycle.
  2. and 3.: Each SM can do one __shfl_sync or one 4B load per cycle. This is a shared resource between the SM partitions. (MIO pipeline)

So it is not easy to load enough data to occupy the Tensor Cores.

So does that means if I need to do lot of matrix multiplication A * B, where B is a permutation matrix. Even though each time the multiplication size is 16 * 8 * 8 and can be operate on tensor core, I still should load the data into registers and use shfl instead?

Tensor Core operations always work on registers.

It means that 1 such MMA instruction has A with 128 values (256 bytes, 8 bytes/thread) and B with 64 values (128 bytes, 4 bytes/thread).

If you apply each A to just one fixed B, and not to several, and if you transfer A over shared memory, then just reading the A matrices (or alternatively shuffle them around) is slower than the actual MMA instructions.

You can accept that or try to reuse the A matrices for more than just a single Tensor Core instruction. (You will probably anyway in some way in your algorithm, otherwise not shared memory, but reading from global memory is the bottleneck, if A is not computed on-the-fly).