Hello,
When there are two warps that both request a 128 byte global memory access directly after each other in a 2.0 GPU, will the requests be executed in parallel (so the total time is 400 - 800 clock cycles), or sequentially (so the total time is 800 - 1600 clock cycles)? I.e., can global memory requests of different warps overlap?
Thanks,
Nikolaus