Alignement requirement


I want to know what do we exactly mean by alignment requirement in coalesced memory accessing and how it affects the performance?

Thanks in advance

Basically, each thread in a block should access memory using at consecutive addresses (4, 8 or 16-byte wide elements), so that the i/o will be performed in a single operation (otherwise it might get split into 16 i/os operations, ie: 16x slower)

For example:
float_array[threadIdx.x] = x

(1-byte and 2-byte wide elements will not coalesce)