Hi all, I am working on optimizing kernel performance by nsight compute, I meet some problems and look for any advise.
I use nsight compute to profile the kernel preformance, and the report warn that it has Uncoalesced Shared Accesses,
However for kernel implement, the thread block shape is 32 * 32, for all threads in each warp, they will access the same address in shared memory, the addresses accessed between warps are neighboring.
I think this access pattern is broadcast, I don’t understand why this kernel has uncoalesced Shared Accesses.
Hope for any advise!
If you don’t wish to provide the code example, my suggestion would be to ask this question on the nsight compute forum.
If you check the L1 Wavefronts Shared Excessive table. specified by the rule, do you find that these specific loads are all access the same address?
Shared memory access patterns are only relevant for each warp instruction. The accesses between warps is not material.
The counter listed as showing shared memory bank conflicts can increment for reasons other than a shared memory bank conflict. As such the recommendation is to follow the rule and check in the Source View to determine if the column L1 Wavefronts Shared Excessive (NOTE: Name was slightly different on older versions of Nsight Compute). The value is this column is calculated only from the memory addresses passed to the instruction.