GEMM is memory bound? (quite large, but tensor core)

0321-fp16.zip (689.8 KB)
0321-fp32.zip (376.5 KB)

hidden_dims = 128
seq_lens = 16384
batch_sizes = 2
num_heads_list = 32

Q = torch.rand((b, num_heads, s, d), dtype=torch.bfloat16, device="cuda")
K = torch.rand((b, num_heads, s, d), dtype=torch.bfloat16, device="cuda")
# Q = torch.rand((b, num_heads, s, d), device="cuda")
# K = torch.rand((b, num_heads, s, d), device="cuda")

output = torch.bmm(Q.view(b * num_heads, s, d), K.view(b * num_heads, s, d).transpose(1, 2))

I tested this on my A100 GPU, if I use fp16, the result seems like memory bound:

If I use fp32, I will get:

Am I correct? GEMM is memory bound? Maybe because tensor core is verrry fast, so the memory can not catch up with it?

I cannot speak to your specific case, as I do not have access to and consequently no experience with A100 hardware.

GEMM is generally considered a typical example of a compute-bound task. However, some of the many different variants of GEMM could have become memory bound by now. First indications that we are heading in that direction were already seen a few years back, when some GEMMs were found to use up 80% of available memory bandwidth, and this phenomenon is not restricted to GPUs.

It seems to be an inevitable development and thus entire foreseeable as compute throughput has grown faster than memory throughput for decades.

1 Like

You are right! Here we see the memory is much higher than compute. But the compute only means cuda core or tensor core or both? (I wish to be both, but below roofline separate these two…)

Thanks!!!