However for kernel implement, the thread block shape is 32 * 32, for all threads in each warp, they will access the same address in shared memory, the addresses accessed between warps are neighboring.
I think this access pattern is broadcast, I don’t understand why this kernel has uncoalesced Shared Accesses.
If you check the L1 Wavefronts Shared Excessive table. specified by the rule, do you find that these specific loads are all access the same address?
Shared memory access patterns are only relevant for each warp instruction. The accesses between warps is not material.
The counter listed as showing shared memory bank conflicts can increment for reasons other than a shared memory bank conflict. As such the recommendation is to follow the rule and check in the Source View to determine if the column L1 Wavefronts Shared Excessive (NOTE: Name was slightly different on older versions of Nsight Compute). The value is this column is calculated only from the memory addresses passed to the instruction.