coalesced data accesses in global memory

hi everone,
i am a starter in cuda.
i was reading about the coalesced data accesses in the global memory from performance guidelines in the programming guide 2.3.1 for cc 1.1 devices.

it is said in there –
"Coalesced 8-byte accesses deliver a little lower bandwidth than coalesced 4-byte accesses and
coalesced 16-byte accesses deliver a noticeably lower bandwidth than coalesced 4-byte accesses "

i dint exactly get that.
can anybody explain it to me with examples?
thanx in advance.

Look at NVIDIA CUDA C Programming Best Practices Guide