How much does memory and compute overlap in a GEMM?

Do I think of the overall latency as memory latency + compute latency or max(memory latency, compute latency)? The former would imply that most of memory and compute are not overlapping, and latter would mean the opposite.



Are you referring a specific latency that’s being reported by some tool? Or latency in general?

If it’s the latter, there is a general guide on how GPU Performance is measured here: The overall perf can be bottlenecked very differently by compute, memory, etc. depending on the problem.