And towards the end, he has this to say about shared memory implementations :

My question is, why? How do we know that for a 32 x 32 tile that all elements in a column map to the same bank? What determines how shared memory banks are mapped?

And then why does increasing the width by 1 solve it?

I’m not sure if it was explained in the blog or not but it’s doing a good job at confusing me.

The bank mapping for shared memory on I believe compute 3.0 and up is the
(address of the look up / 4) % 32
The divide by four is because shared banks are aligned to 32 bits but addresses are 8 bit. Mod 32 is because there are 32 banks.
Now in a 1D array of ints each element has an address divisible by 4 (since the values are aligned), also each element is at an address 4 higher than the previous (because the array is contiguous in memory) so a 1D array with 32 values might have addresses like.

112 116 120 124 128 132 136 ... 228 232 236

Then if you look at the banks of those address you get

28 29 30 31 0 1 2 ... 25 26 27

Now if we add an extra row to make a 2d array the first element of row 1 is right after the last of row 0 since all the elements are contiguous in memeory.

Note that if you had any multiple of 32 columns all the banks would match in every row. Also if you had a number of columns that evenly divides 32 then each row would follow a pattern (with 16 columns each row would alternate between 2 banks).

Now if we add one more element to columns raising it to 33.