The Programming Guide in Section 22.214.171.124 says that the overhead associated with a global memory access is 400-600 clock cycles.
Later on it says that 128 bit coalesced global memory access bandwidth is significantly lower than that associated with 32 bit coalesced accesses. In 126.96.36.199 the guide also indicates that non-coalesced global memory access has one order of magnitude lower bandwidth compared to the 32 bit coalesced case.
I was wondering whether the 400-600 numbers above refer to the 32 bit coalesced case, which is the best case scenario, or to the worst case scenario of non-coalesced 128 bit global memory access.
Thank you for taking the time.