Global memory overhead

The Programming Guide in Section says that the overhead associated with a global memory access is 400-600 clock cycles.
Later on it says that 128 bit coalesced global memory access bandwidth is significantly lower than that associated with 32 bit coalesced accesses. In the guide also indicates that non-coalesced global memory access has one order of magnitude lower bandwidth compared to the 32 bit coalesced case.

I was wondering whether the 400-600 numbers above refer to the 32 bit coalesced case, which is the best case scenario, or to the worst case scenario of non-coalesced 128 bit global memory access.

Thank you for taking the time.

Throughput (bandwidth) and latency are different beasts altogether. With enough threads in flight, latency is almost completely hidden so it doesn’t matter in the end.

Throughput does matter greatly for performance, though.

I did complete testing of all the various ways to read/write 32-bit, 64-bit and 128-bit values:…41&#entry290441

Thank you for your post, MisterAnderson.

I’m new at the GPU game and try to learn more about this. Is is possible for you to post as an attachement the code that you used in conjunction with what you called “the full gammut of tests” (your Dec. 6 post)?

Thank you, have a good day.

It’s already posted. Just scroll down a little lower in that thread.