Shared memory bank conflicts with byte arrays


It appears that shared memory is arranged as interleaved 32bit words, each belong to a different bank.
Suppose I define a byte array as shared memory. Does that mean the first four bytes belong to one bank and the next to another and so on…?
What does that mean for avoiding memory bank conflicts say for copying one shared memory buffer to another in parallel?


Each thread should copy four (consecutive, aligned) bytes at once.

On Fermi, there’s no bank conflict on reading byte arrays since it supports multibroadcast shared memory.

But on GT200 and G80, you’ll get conflicts and as Tera says, it’s more efficient to have each thread copy a word.

But if you’re just COPYING the byte array you should use the one-word-per-thread method anyway just for speed even on Fermi.

There are some articles out there saying that accessing byte arrays cause bank conflict. But in CUDA C PROGRAMMING GUIDE (v8.0.61) G3.3. Shared Memory (, it says:

This seems to apply to compute capability >= 2.x.

So am I correct in concluding that accessing byte arrays in a shared memory does cause bank conflict in compute capability 1.x (Tesla), and does not cause for modern GPUs with compute capability >= 2.x (Fermi, Kepler, Maxwell, and Pascal)?

Honestly, who cares about cc 1.x?

Yes, for cc2.x and higher, two threads that access bytes in the same location will not cause bank conflicts, effectively due to the broadcast mechanism.