I have a cuda program that has the data stored in shared memory of two dimensional array of float[8*8][17]. Data are read in different intervals from the shared memory and calculated, then stored back to the shared memory. The reported stats are as follows:
smem load transactions/request: 1.88
smem store transactions/request: 1.72
bank conflict per request: 0.09
replay overhead: 4.51%
smem achieved bandwidth: 290GB/s
I don’t think there should be any bank conflicts though. The “useful” data is actually in a 3D dimension of 8x8x16, I padded it to 8x8x17 to avoid bank conflict (stored in a two dimensional array of 64x17). There are 64 threads in the block and they are basically reading a 4x16 (4 on the horizontal side and 16 on the depth side) region out of the 3D cube at the same time. For this 4x16 region, each item on the depth direction has address of difference 1 (in 4-bytes, or 4 in bytes), and each item in the horizontal direction has address of difference 17*8. In other words, the addresses in this 4x16 region looks like this:
s+0 178+s 1782+s 1783+s
s+1 178+s+1 1782+s+1 1783+s+1
…
s+15 178+s+15 1782+s+15 178*3+s+15
where s could be a number between 0 to 7. I don’t see there is a way that the address difference could be multiples of 32, or am I missing something?