The changes to the coalescence conditions in the new compute hardware are interesting and seem clearly explained on page 52. However, one aspect of global memory access which I can not find explained clearly anywhere is whether the access is synchronous or asynchronous with respect to the device code execution.
Under a synchronous model, I mean that a global read would cause the thread warp to block at that point in the device code until the read is completed. This would require there to be other active warps and/or active blocks to hide the 400-600 cycle latency.
Under an asynchronous model, it would start the gobal read and then immediately continue executing the device code until it reached a point at which the read value is used, where it would then have to wait until the read had completed. (This is like asynchronous message-passing in MPI.) In this model, the read latency could be hidden for a single block with a single warp by having enough compute to be performed between the global read and the first use of the data read.
The same question arises also for global writes.
Based on the comment about the thread scheduler in the final paragraph of 5.1.1.3 I suspect it may be the synchronous model, but I’d appreciate it if someone could clarify this.
Sorry if you think I should have posted this to the regular forums, but it could easily be something which has changed in the new hardware since you have clearly made significant changes to global memory access – that’s why I have posted it here.
Thanks,
Mike