The Programming Guide in Section 5.1.1.4 says that the overhead associated with a global memory access is 400-600 clock cycles.
Later on it says that 128 bit coalesced global memory access bandwidth is significantly lower than that associated with 32 bit coalesced accesses. In 5.1.2.1 the guide also indicates that non-coalesced global memory access has one order of magnitude lower bandwidth compared to the 32 bit coalesced case.
I was wondering whether the 400-600 numbers above refer to the 32 bit coalesced case, which is the best case scenario, or to the worst case scenario of non-coalesced 128 bit global memory access.
Throughput (bandwidth) and latency are different beasts altogether. With enough threads in flight, latency is almost completely hidden so it doesn’t matter in the end.
Throughput does matter greatly for performance, though.
I’m new at the GPU game and try to learn more about this. Is is possible for you to post as an attachement the code that you used in conjunction with what you called “the full gammut of tests” (your Dec. 6 post)?