hi everone,
i am a starter in cuda.
i was reading about the coalesced data accesses in the global memory from performance guidelines in the programming guide 2.3.1 for cc 1.1 devices.
it is said in there –
"Coalesced 8-byte accesses deliver a little lower bandwidth than coalesced 4-byte accesses and
coalesced 16-byte accesses deliver a noticeably lower bandwidth than coalesced 4-byte accesses "
i dint exactly get that.
can anybody explain it to me with examples?
thanx in advance.