GEMM is memory bound? (quite large, but tensor core)

202476410arsmart · March 25, 2024, 7:47am

0321-fp16.zip (689.8 KB)
0321-fp32.zip (376.5 KB)

hidden_dims = 128
seq_lens = 16384
batch_sizes = 2
num_heads_list = 32

Q = torch.rand((b, num_heads, s, d), dtype=torch.bfloat16, device="cuda")
K = torch.rand((b, num_heads, s, d), dtype=torch.bfloat16, device="cuda")
# Q = torch.rand((b, num_heads, s, d), device="cuda")
# K = torch.rand((b, num_heads, s, d), device="cuda")

output = torch.bmm(Q.view(b * num_heads, s, d), K.view(b * num_heads, s, d).transpose(1, 2))

I tested this on my A100 GPU, if I use fp16, the result seems like memory bound:

If I use fp32, I will get:

Am I correct? GEMM is memory bound? Maybe because tensor core is verrry fast, so the memory can not catch up with it?

njuffa · March 25, 2024, 7:56am

I cannot speak to your specific case, as I do not have access to and consequently no experience with A100 hardware.

GEMM is generally considered a typical example of a compute-bound task. However, some of the many different variants of GEMM could have become memory bound by now. First indications that we are heading in that direction were already seen a few years back, when some GEMMs were found to use up 80% of available memory bandwidth, and this phenomenon is not restricted to GPUs.

It seems to be an inevitable development and thus entire foreseeable as compute throughput has grown faster than memory throughput for decades.

202476410arsmart · March 25, 2024, 7:59am

You are right! Here we see the memory is much higher than compute. But the compute only means cuda core or tensor core or both? (I wish to be both, but below roofline separate these two…)

Thanks!!!

Topic		Replies	Views
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1830	August 8, 2023
Calculation of Memory Bound nature vs Roofline numbers Nsight Compute	3	992	May 18, 2023
TF32 GEMM sample very slow compared to generic GEMM CUDA Programming and Performance	5	798	June 30, 2022
CUDA lib performance on Ampere architecture CUDA Programming and Performance	2	828	April 22, 2021
Matrix Multiplication (GEMM) Register Tiling and Shared Memory Bandwidth Bound CUDA Programming and Performance kernel	4	1656	December 18, 2022
Mixed precision GEMM Performance (A100 & V100) CUDA Programming and Performance	1	1478	December 3, 2021
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	175	January 10, 2025
memory bound CUDA Programming and Performance	3	1204	April 10, 2013
How to calculate if a kernel is compute bound or memory bound based on Peak FLOPS and Peak Memory Bandwidth? CUDA Programming and Performance cuda , kernel , drive-cuda , gpu	4	4946	November 3, 2021
Volta 100 LINPACK performance and energy-efficiency CUDA Programming and Performance	4	980	February 26, 2018

GEMM is memory bound? (quite large, but tensor core)

Related topics