Is there any way to find out the location in cuda code that cause shared memory bank conflicts?

Hello,

I have Write a Cuda Kernel, with carefully shared memory arrangement,and it should be bank conflict free.

If I emit the Kernel with only one Block, that is, set GridDim.x, GridDim.y, GridDim.z to 1, and the block contains 2 warps with 64 threads, it shows there are NO Shared load/store bank conflicts by using nsight compute.

But things come different when the GridDim get larger (eg. GridDim.x = 3, GridDim.y = 4, GridDim.z = 256 ), nsight compute shows thousands of shared load/store bank conflicts.

So, How can I locate the code? Does Bank conflict occurs when different warp scheduler access shared memory at the same time ? Is there any way to avoid?

Robert’s reply here may help understanding the situation:

In general shared memory bank conflicts can occur any time two different threads are attempting to access (from the same kernel instruction) locations within shared memory for which the lower 4 (pre-cc2.0 devices) or 5 bits (cc2.0 and newer devices) of the address are the same.

Wondering about how to avoid bank conflict from two different threads of different warps running on different processing blocks of a SM?

BTW, short scoreboard stall will (also) occur and be reported by NSight Compute, for bank conflict of this kind, am I right?