It appears that shared memory is arranged as interleaved 32bit words, each belong to a different bank.
Suppose I define a byte array as shared memory. Does that mean the first four bytes belong to one bank and the next to another and so on…?
What does that mean for avoiding memory bank conflicts say for copying one shared memory buffer to another in parallel?
So am I correct in concluding that accessing byte arrays in a shared memory does cause bank conflict in compute capability 1.x (Tesla), and does not cause for modern GPUs with compute capability >= 2.x (Fermi, Kepler, Maxwell, and Pascal)?
Yes, for cc2.x and higher, two threads that access bytes in the same location will not cause bank conflicts, effectively due to the broadcast mechanism.