Hello, just checking to make sure that it’s a matter of which banks the threads of a warp are accessing when the addresses are in __shared__, not that all of the addresses fit neatly in a contiguous range of 32. Is that correct? If my threads are accessing indices 5, 38, 71, 104, …, 995, 1028 of an array (for simplicity, the array elements have size four bytes), those accesses are going to be handled by banks 5, 6, 7, …, 31, 0, 1, 2, 3, 4. That should be the same performance as a warp accessing elements 5, 6, 7, 8, …, 35, 36 of the same array, no? Of course, when accessing global the latter is OK (best would be 0, 1, …, 30, 31) but the former is the absolute worst–every thread taking its own cache line.
correct. For maximum throughput to shared memory, the rule is, considering a warp-wide access, we want no more than one item per bank requested. It is not necessary that all addresses be contiguous. Shared memory generally also has the broadcast rule. That means that if there are multiple requests to the same bank, but they are also to the same location, this does not reduce efficiency. A particular location can be broadcast to multiple threads in a warp, per transaction, at no additional cost.
Thanks, Robert. I’ve been doing some head math and thinking that I may be piling things up, much more than is needed, on a handful of the __shared__ banks. And I have another thing that I want to do that will be even more powerful if I don’t need to worry about keeping all the accesses adjacent per se.