Unexpected shared memory bank conflict.

I observed shared memory bank conflict in the following case:

Using 128-bit shared load on an RTX GPU.
Byte offset = tid % 4 * 16

That is:
tid, byte offset
0, 0
1, 16
2, 32
3, 48
4, 0
5, 16
6, 32
7, 48

I assume that there is no bank conflict in this case because data at location 0 will be broadcasted to thread 0, 4, 8, 12, 16, 20, 24, 28.
But profiling result shows that there are 4 requests generated for each LDS.U.128 instruction. (I expect 2 requests.)

How should I explain this?

Thanks!

On CC 7.0 - 7.5 devices shared memory loads with uniform addresses can increase bandwidth if the following is true:

  • Thread pairs (Tn and Tn^1) have the same addresses for all active threads (i.e. T0==T1, T2==T3, T4==T5, T6==T7, etc.), or
  • Thread pairs (Tn and Tn^2) have the same addresses for all active threads (i.e. T0==T2, T1==T3, T4==T6, T5==T7, etc.)
  • (Note these encompass the case where all active threads have the same address)

The return bandwidth of L1 and shared memory is 128 bytes/cycle == 1 register/cycle. The uniform access can achieve 2x the performance by packing the return registers. In the case above the rules are not met so LDS.128 requires 4 requests per instruction as it returns 4 registers. If the guidelines are met the requests/instruction should reduce from 4 to 2.

I have changed the access pattern according to the “thread pairing” rule. The bank conflict now disappears. Thank you!