Global memory access is often the bottleneck of a CUDA implementation. Therefore, latency hiding is used, i.e. while thread 1 is waiting for a global memory access, other threads become active and perform arithmetic computations. If there are really a lot of arithmetic computations to do compared to global memory accesses, the arithmetic computations are dominating and the global memory access time can be ignored. Otherwise, the global memory access time is dominating and the arithmetic computations time can be ignored. In either case, the speedup from latency hiding is at most a factor of 2 compared to no latency hiding… at least that’s my theory. Could this be correct? Or does someone have experience with greater speedups?