Meanings of L2 --> L2 copy


Every blocks process some global data and no repeated load.
But there are 16.93M L2 to L2 copy. What’s the reason for that?

You can refer to the A100 L2 Cache section in the A100 whitepaper for some more info on the L2 cache for this GPU:

The A100 L2 cache is a shared resource for the GPCs and SMs and lies outside of the GPCs.
The L2 cache is divided into two partitions to enable higher bandwidth and lower latency
memory access. Each L2 partition localizes and caches data for memory accesses from SMs in
the GPCs directly connected to the partition.

If there are transfers between the two partitions needed, it means that data was accessed from an SM that wasn’t local to the cache partition this data resided on.