I observed shared memory bank conflict in the following case:
Using 128-bit shared load on an RTX GPU.
Byte offset = tid % 4 * 16
That is:
tid, byte offset
0, 0
1, 16
2, 32
3, 48
4, 0
5, 16
6, 32
7, 48
…
I assume that there is no bank conflict in this case because data at location 0 will be broadcasted to thread 0, 4, 8, 12, 16, 20, 24, 28.
But profiling result shows that there are 4 requests generated for each LDS.U.128 instruction. (I expect 2 requests.)
How should I explain this?
Thanks!