How to optimize "L2 Load Access Pattern"

L2 Load Access Pattern
Est. Speedup: 33.47%
The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the ►Source Counters section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory request.

I am not sure, how to optimize data from L1 to L2? I guess… this can not be optimized? Thank you!!!

I see this in my nsight compute:

Also, I see this figure, but do not understand what I can learn from this figure:

What conclusion can be drawn here?

I find the guidance for Analysis Driven Optimization is too few… If someone could suggest me some books or blogs(I have found three from NV), it would be highly appreciated!

Please refer the uncoalescedGlobalAccesses Nsight Compute sample under extras/samples/uncoalescedGlobalAccesses.

You can also refer the GPU Technology Conference 2021 talk S32089: Requests, Wavefronts, Sectors
Metrics: Understanding and Optimizing Memory-Bound Kernels with Nsight