Uncoalesced Local Accesses

Hello! I wrote a simple CUDA program to test the local accesses coalescing patterns, and here’s the kernel code:

Screenshot from 2024-04-02 14-49-27

I run the kernel on a V100 GPU and used Nsight Compute to analyze it, and it shows “only 1.0 of the 32 bytes transmitted per sector are utilized by each thread” for local loads and stores when I launch 32 threads, and here’s the screenshot:

I’m wondering if I misunderstand the metrics, since even if the local accesses are not coalesced, I guess a single int value is not supposed to be spread into different cache sectors? Thank you very much for the help!