I cannot speak to your specific case, as I do not have access to and consequently no experience with A100 hardware.
GEMM is generally considered a typical example of a compute-bound task. However, some of the many different variants of GEMM could have become memory bound by now. First indications that we are heading in that direction were already seen a few years back, when some GEMMs were found to use up 80% of available memory bandwidth, and this phenomenon is not restricted to GPUs.
It seems to be an inevitable development and thus entire foreseeable as compute throughput has grown faster than memory throughput for decades.
You are right! Here we see the memory is much higher than compute. But the compute only means cuda core or tensor core or both? (I wish to be both, but below roofline separate these two…)