A100 data movement inside of the Memory

nokanaran · April 27, 2023, 7:58am

Hi,

I am having difficulty understanding the figure presented.

For V100 I am seeing 4 lines for each warp but A100 has 1 lines for each wart to comunicate with RF. Based in that I expect V100 should be faster for comunication with RF, but A100 is faster.

Why V100 has 4x more warps?

Also from SMEM V100 has 2x more conections, so how A100 is better?

Robert_Crovella · April 27, 2023, 6:43pm

Considering you linked the whitepaper, I will refer to the diagram on p40 of the whitepaper. It’s not identical to the picture you posted, which appears to be from a presentation of some sort.

Several kinds of efficiency improvements are being depicted in this diagram:

A100 introduced hardware-accelerated asynchronous shared memory loading from global. This is reflected in the A100 flow that shows that there are no trips through L1 or RF prior to the data entering SMEM.
More efficient Tensorcore patterns in A100. This is depicted in the difference in the diagrams that show 4 blue arrows from SMEM to RF, as well as the 4 black arrows for each warp, in the V100 case, whereas the A100 cases shows 2 of each. The idea here is that a tensor core op involves loading of operands into a particular register pattern in the warp. The pattern of operands in memory will determine the efficiency with which this can be accomplished. The A100 tensorcore load operations can be more efficient, due to the ordering of operands in memory that are needed by each warp, in a tiled matrix multiply setting. Also, keep in mind that the highest throughput tensorcore op in V100 was four 8x8 multiplies per warp/op, whereas A100 can efficiently do a single 16x16 multiply per warp/op.

As a result of these two changes, the number of operations to/from shared are reduced (improving the bandwidth situation) and the capacity requirements in the register file (“in flight”) are cut in half, because we no longer need a temporary copy of the operands as they make their way from global memory to shared memory.

The arrows don’t reflect communication links, paths or connectivity, but instead represent transactions required during a typical tiled matrix-multiply operation. One reason A100 is faster is because it uses available capacity more efficiently for the same work.

V100 does not have or require 4x more warps. The warps on each side of the diagram are labelled in black, warp0, warp1, warp2, warp3. However to deposit the operands as they appear in memory into the necessary thread/register file pattern to support the tensorcore op in V100 requires more transactions than it requires in A100.

Again, those are not connections. Those are operations or transactions that are required for a particular work sequence.

Topic		Replies	Views
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1247	July 7, 2023
Mma instructions on A100 CUDA Programming and Performance	5	110	October 1, 2024
Inconsistent performance on the A100 nvc, nvc++ and nvfortran	4	1196	March 22, 2022
Performance of A100 vs. V100s for mixed pression CUDA Programming and Performance	1	957	December 3, 2021
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	482	November 25, 2024
Inside Volta: The World’s Most Advanced Data Center GPU Technical Blog	43	989	October 1, 2018
Nvidia announces Tesla V100 (Volta) CUDA Programming and Performance	19	5224	November 30, 2017
NVIDIA Ampere Architecture In-Depth Technical Blog	0	941	August 25, 2020
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	296	August 20, 2024
Theoretical peak performance question GF100 can't co-issue instructions can it? CUDA Programming and Performance	15	3341	March 3, 2011

A100 data movement inside of the Memory

Related topics