How to utilize L2 partition?

Like shown here:

Ampere’s L2 cache is partitioned into two parts. If we can store the data near the corresponding L2, we can save time to move data! But how can we do that? Any example? Thank you!!!

There are some details in the Programming Guide here.

Thanks! I see how to use L2, but no how to find the “nearer corresponding L2”…

I see now, in the “Dissecting…” reference above, that I misunderstood the information you were looking for.

I can’t help with your query, but I do see that the link I gave above to the Programming Guide section on L2 Access Management, is now broken, due to the new documentation layout.

As I can no longer edit the previous post, the current location for this is here.

I think you need aware the algorithm implementation, conclude the mapping between thread index and ld/st memory address pattern and generate the lookup table to map the virtual thread index and cuda level thread index.