I have a question targeted on those much more experienced CUDA developers than I am. A lot of informations have been written about coalesced memory read/write operations. I understand basics of that concept and I also understand how to achieve high read/write bandwith with coalescing. A have read CUDA Programming Guide with attention as well as Optimization Strategies described by Mark Harris (SC 2007) and I have also read many topics on this forum. But … I still don’t fully understand to “background” of coalescing. To be more precise:
Devices support 128-bit memory read using one read instruction. 384-bit memory bus of the device should thus support reading of 3 float4 vectors during one clock cycle. Before I tried coalescing, I believed, that using float4 data types would permit maximal memory bandwith (uncoalesced float4 read/write), but it doesn’t. So what coalescing exactly means? Does it support read/write of some blocks of device memory? And if so, what is reason for such behaviour? Does it mean, that address bus of the device supports only one address during one clock cycle and whole block of that address should be read/written? And is there any instruction which permits clolesced read/write of more than 128-bits?
I would be grateful for any explanation or links to some technical details about CUDA related devices I should read. I’m using CUDA when working on my thesis and I really need to understand to concept of coalesced memory. Thank you very much…