How much does memory and compute overlap in a GEMM?

Do I think of the overall latency as memory latency + compute latency or max(memory latency, compute latency)? The former would imply that most of memory and compute are not overlapping, and latter would mean the opposite.

Thanks!

Hi,

Are you referring a specific latency that’s being reported by some tool? Or latency in general?

If it’s the latter, there is a general guide on how GPU Performance is measured here: https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html. The overall perf can be bottlenecked very differently by compute, memory, etc. depending on the problem.