A100 data movement inside of the Memory


I am having difficulty understanding the figure presented.

For V100 I am seeing 4 lines for each warp but A100 has 1 lines for each wart to comunicate with RF. Based in that I expect V100 should be faster for comunication with RF, but A100 is faster.

Why V100 has 4x more warps?

Also from SMEM V100 has 2x more conections, so how A100 is better?

Considering you linked the whitepaper, I will refer to the diagram on p40 of the whitepaper. It’s not identical to the picture you posted, which appears to be from a presentation of some sort.

Several kinds of efficiency improvements are being depicted in this diagram:

  1. A100 introduced hardware-accelerated asynchronous shared memory loading from global. This is reflected in the A100 flow that shows that there are no trips through L1 or RF prior to the data entering SMEM.

  2. More efficient Tensorcore patterns in A100. This is depicted in the difference in the diagrams that show 4 blue arrows from SMEM to RF, as well as the 4 black arrows for each warp, in the V100 case, whereas the A100 cases shows 2 of each. The idea here is that a tensor core op involves loading of operands into a particular register pattern in the warp. The pattern of operands in memory will determine the efficiency with which this can be accomplished. The A100 tensorcore load operations can be more efficient, due to the ordering of operands in memory that are needed by each warp, in a tiled matrix multiply setting. Also, keep in mind that the highest throughput tensorcore op in V100 was four 8x8 multiplies per warp/op, whereas A100 can efficiently do a single 16x16 multiply per warp/op.

As a result of these two changes, the number of operations to/from shared are reduced (improving the bandwidth situation) and the capacity requirements in the register file (“in flight”) are cut in half, because we no longer need a temporary copy of the operands as they make their way from global memory to shared memory.

The arrows don’t reflect communication links, paths or connectivity, but instead represent transactions required during a typical tiled matrix-multiply operation. One reason A100 is faster is because it uses available capacity more efficiently for the same work.

V100 does not have or require 4x more warps. The warps on each side of the diagram are labelled in black, warp0, warp1, warp2, warp3. However to deposit the operands as they appear in memory into the necessary thread/register file pattern to support the tensorcore op in V100 requires more transactions than it requires in A100.

Again, those are not connections. Those are operations or transactions that are required for a particular work sequence.