It’s not too hard to max-out the bandwidth of a device to pretty close to theoretical limits by having every warp reading a single sequential word from device memory, all aligned to a nice boundary of an even 32 words.
But why then is there a lot of code that carefully reads groups of float4 or int4 per thread to get max bandwidth, instead of one word per thread?
I think I understand the answer but the programming guide is silent, and I got most of the hints from other posts here on the forum, so I want to make sure I understand.
My impression, and I want to be corrected if wrong, is that maybe there actually IS no bandwidth improvement in reading float4 vs single floats. The reason float4 reads are superior (when applicable) is that they use fewer instructions by queuing up 4 words per thread at once. That may not be important if the kernel is just doing a memory copy, but IS a savings if its doing other compute since its basically 3 free instructions saved.
Is my understanding correct? Or is there some other advantages?