I am a bit confused by the global and shared memory performance description. Global memory reads are fastest if they are 16 bytes aligned and coalesced into one single contiguous block for the entire warp.
Now assume that each thread of a block loads a 16 byte aligned structure into shared memory. Further assume each thread performs calculations on the structure that it has loaded. Access to shared memory will then be highly inefficient as we have many bank conflicts.
For example using a 16 byte structure (e.g. float4). 16 threads of a warp will access the shared memory, and thus 4 x 4 threads will access the same bank!
To get it fast I would have to do a 4 byte padding within shared memory?