Operation time compare

half-0 · December 15, 2024, 2:46pm

For A100, I would like to know how many cycles will cost for the following operations:

fp16 mma.m16m8k8
__shfl.sync
load 4B data from shared memory without bank conflict

rs277 · December 15, 2024, 6:21pm

For 1. See Fig. 7 of this paper.
For 3. See Table 10 of same paper.

Curefab · December 15, 2024, 8:42pm

Each of the 4 SM partitions can do the mentioned MMA instruction every 4 cycles, so on average 1 instruction per SM per cycle.
and 3.: Each SM can do one __shfl_sync or one 4B load per cycle. This is a shared resource between the SM partitions. (MIO pipeline)

So it is not easy to load enough data to occupy the Tensor Cores.

half-0 · December 16, 2024, 8:06am

So does that means if I need to do lot of matrix multiplication A * B, where B is a permutation matrix. Even though each time the multiplication size is 16 * 8 * 8 and can be operate on tensor core, I still should load the data into registers and use shfl instead?

Curefab · December 16, 2024, 12:35pm

Tensor Core operations always work on registers.

It means that 1 such MMA instruction has A with 128 values (256 bytes, 8 bytes/thread) and B with 64 values (128 bytes, 4 bytes/thread).

If you apply each A to just one fixed B, and not to several, and if you transfer A over shared memory, then just reading the A matrices (or alternatively shuffle them around) is slower than the actual MMA instructions.

You can accept that or try to reuse the A matrices for more than just a single Tensor Core instruction. (You will probably anyway in some way in your algorithm, otherwise not shared memory, but reading from global memory is the bottleneck, if A is not computed on-the-fly).

Topic		Replies	Views
Tensor core mechanism CUDA Programming and Performance	10	205	February 22, 2025
Mma m8n8k4 on A100 CUDA Programming and Performance	10	192	November 14, 2024
Cuda operations along side Tensor operations CUDA Programming and Performance	2	486	October 12, 2021
Conflict of shfl.sync() CUDA Programming and Performance	16	154	December 15, 2024
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	995	August 22, 2024
Load data for tensor core CUDA Programming and Performance	23	86	February 5, 2025
Can we directly use register value for tensor core calculation? CUDA Programming and Performance	4	614	October 18, 2023
Question on CTA Execution and Tensor Core Parallelism CUDA Programming and Performance	1	65	September 23, 2024
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1462	July 7, 2023
Why is __shfl slower than shared memory CUDA Programming and Performance	7	5893	November 27, 2014

Operation time compare

Related topics