Roofline model's different chart's understanding

0321-fp16.zip (689.8 KB)
This is my report for simple code, just use fp16 in torch to do bmm GEMM…

So my questions are about the roofline chart:

  1. I’ve noticed that in the roofline model, the points corresponding to L1, L2, and L3 caches have the same y-coordinate values. I assume that the total computational volume is identical across L1, L2, and L3 since computation is not directly related to the cache hierarchy levels. However, the memory traffic passing through each cache level varies, which results in different FLOPs/byte ratios. Generally, the L1 point is the farthest to the right, indicating that L1 has the least memory traffic passing through. Why is this the case?
  2. Moreover, it is often observed that L1 is compute-bound while L2 and DRAM are memory-bound. How should one optimize in such scenarios?
  3. In the roofline model’s “speed of light,” does the compute (SM) throughput refer only to the FP32 cores, or does it include tensor cores as well? Logically, it seems it should encompass all cores, correct?
  4. When determining whether a system is compute-bound or memory-bound, should this assessment be based specifically on the performance of FP32 cores or tensor cores, or is it necessary to consider both for an accurate evaluation?