It is well known and mentioned specifically in the programming guide that uncoalesced global memory access is much slower than coalesced global memory access. The guide states that 32-bit coalesced is the fastest, followed by 64-bit and 128-bit access, whereas for uncoalesced access, 128-bit access is the fastest, then 64-bit and then 32-bit.
The question I have is: by how much specifically?
If 32-bit coalesced global memory access should give something close to the theoretical maximum bandwidth on a device, what is the exact fraction of that you can expect with every other combination (64-bit coalesced, 128-bit coalesced, 32-bit uncoalesced, 64-bit uncoalesced, 128-bit uncoalesced)? So, coalesced 32-bit is 1/1 and everything else is some other fraction.
I’m aware this may depend on the actual device in question, but for an example let’s say on the G80 series 8800 GTX.
Intuitively my guess is that CUDA will issue (device global memory interface width / total read size) memory transactions of exactly the interface width in size for coalesced access, but each stream processor can only one 32-bit memory transaction in one instruction, so for 64-bit and 128-bit access this would mean the accesses, while coalesced, would be 1/2 and 1/4 respectively in speed relative to coalesced 32-bit access, regardless of the bus width of device.
For uncoalesced 32-bit access, CUDA will be issueing one memory transaction per thread which only occupies 32-bits of the bus width, so for a 384-bit interface on an 8800 GTX this would be 1/12 the speed of coalesced 32-bit access. For uncoalesced 64-bit and 128-bit access, I’m guessing CUDA attempts to issue 2 transactions per 2 threads and 4 transactions per 4 threads respectively, such that in a 64-bit access the first thread will access the first 32-bit word while the 2nd thread simultaneously accesses the 2nd 32-bit word of it’s 64-bit memory operation, then vise versa in the next transaction, and in the same fashion for a 128-bit access. This would make the fractions for uncoalesced access on a 8800 GTX 1/12 for 32-bit, 1/6 for 64-bit, and 1/3 for 128-bit.
Obviously this is just speculation but it seems to make sense that is is how it might be implemented in hardware and it fits with CUDA’s description of bandwidth rates for the various types of global memory access. One interesting thing is that in this way, 128-bit uncoalesced access is faster than 128-bit coalesced access, but this is all speculation. Does anyone have any hard numbers here?