How to determine whether a GEMM is bound on L1 or L2?

How to determine GEMM is bound in L1 or L2? Maybe using ncu?

ncu is probably the way to go. You could use metrics, of course, but basic ncu reporting includes a memory hotspot chart in the memory workload analysis report section. Using the GUI, you can quickly visually verify which memory paths are at the highest level of utilization. Likewise, in the SOL report section, you will have a memory breakdown report, which will show a pareto list of memory path utilizations.

A basic intro-to-ncu blog is here.
A more detailed profiling blog including discussion of the memory hotspot chart as well as profiling of a cublas gemm kernel is here. (link to part 1. blog is in 3 parts. part 1 shows a picture of the SOL section memory breakdown. Memory hotspot chart example (from the ncu memory workload analysis report section) is in part 2, cublas gemm kernel profile is in part 3)
A video tutorial of the SOL section mentioning the memory breakdown pareto is here. The memory throughput breakdown is shown in the column on the right at about the 10-minute mark.

I have a question, does “highest level of utilization” means this cache level is bound?

Like, if we have L1 cache’s peak throughput is 999999999999TB/s, so its utilization is low. But this does not mean, it is more bound than L2.

Actually, I am considering, which cache level is more bound? I think I need to compare L1~compute, L2~compute, and compare these two metric. But I am not sure what metric I should use here…

Thanks!!!

The data I mentioned (the hotspot chart, and the memory breakdown) are reported as percentages of peak. So its easy to see which path is closer to its peak.

You can hover your mouse over the memory breakdown chart to get metric info for the reported paths (i.e. which metrics are used to compute it.)

if we have L1 cache’s peak throughput is 999999999999TB/s, so its utilization is low. But this does not mean, it is more bound than L2.

Thank you! But I think this explain why I do not think comparing the percentage of each cache can determin which cache is more bound…

Someone suggest me to compare the hit rate of different cache level, sounds reasonable, but in GEMM, we do not use L1, but shared memory…

So how can we determine which cache is more bound?

If you understand “bound by” as “the limiting factor or bottleneck is” than percentage of peak throughput (or percentage of peak number of transactions) is exactly, what you want,

The hit rate is important, but not for determining, whether the program is bound by a cache level, e.g. compare a program, which loads just two values from memory, but does lots of computations. If they are the same, you have 50% hit rate, otherwise 0%. Vastly diferent hit rate. Does it matter? No, as 1 or 2 transactions are so far below the peak throughput. (You can also have a similar program with nearly 100% hit rate, e.g. by just loading 20 values from the same address. Same conclusion.)

Actually, I used cutlass’ threadblock swizzle, but find little benefit. And I tried to prove, L1 is more bound than L2, so I can optimize L1 and benefit could be larger!

I compared several cases:

L1 throughput% L2 throughput% DRAM throughput%
a40-lamma3-8B-b124 36% 68% 59%
a100-lamma3-70B-b32 49% 56% 34%
4090-lamma3-70B-b32 18% 37% 12%
a40-attention-hid128-seq4096-b32-nhead36-bmm 61% 73% 63%
4090-attention-hid512 18% 37% 32%

So does this mean, if I optimize L1(shared memory) I can get more benefit than L2?

By the way, I get these throughput% from NCU speed of light report.

You could also have a look into Warp State Statistics and compare the value for long and short scoreboard, which typically give you the amount of waiting for global or shared memory.

1 Like

I find it difficult to make any kind of generalized statements about the performance characteristics of “GEMM” because there are so many different flavors based on matrix size, matrix aspect ratios, transpose modes, element types, GPU architectures. Even early versions of CUBLAS already provided dozens of different kernels for “GEMM”, not sure where that number is today.

Presumably you are interested in a specific subset of GEMM variants, and it might be helpful to specify what exactly that is.

That seems like a reasonable approach to me.

This sounds very interesting! But which is for global and which is for shared?

Thanks!

L1 throughput% L2 throughput% DRAM throughput%
a40-lamma3-8B-b124 36% 68% 59%
a100-lamma3-70B-b32 49% 56% 34%
4090-lamma3-70B-b32 18% 37% 12%
a40-attention-hid128-seq4096-b32-nhead36-bmm 61% 73% 63%
4090-attention-hid512 18% 37% 32%

So according to this, L2 is the bottleneck here?

I am using cutlass with TB size 12812832, warp size=646432, half, m16n8k16, tensor core.

But when I used cutlass threadblock swizzle, only on A40 it shows about 1.2~1.3x speed up, when M is small(4096 or less), almost 1.05x speed up. Seems too small…

is that referring to llama3 ?

I would say based on your data that:

  1. L2 is more of a bottleneck than L1
  2. L2 is more of a bottleneck than the DRAM bus/throughput

Which is all I really meant to communicate in my original responses. L1 and L2 can be compared for their relative effect on code performance (limitation) by comparing the numbers I referred to (which you now seem to be doing). And that seems to be one of the questions you are asking:

That doesn’t answer the question is L2 the bottleneck. For that, you would have to compare against all other relevant pathways. Again, ncu tries to help you with this, for example in the compute breakdown and memory breakdown in the SOL report section (as well as various other data presented by nsight compute, like the memory hotspot chart I referred to). For example, a well written GEMM could be compute bound (and I already pointed out an example in one of the blogs) more than being in any way memory bound. That is to say, the relevant compute pathway utilization as a percentage of peak is higher than any relevant memory pathway utilization. Which might be the case for some of your test cases.

Stall Math Pipeline seems to be the current massive bottleneck.
You can confirm by looking at Compute Workload Analysis to see the utilization of the individual math pipelines. I would expect the relevant one(s) to be more or less at 100%.

Long scoreboard is for global memory. Shared memory is short scoreboard (and possibly MIO), but all values below 1.0 can normally be ignored.