global memory access synchronous or asynchronous read/write?

The changes to the coalescence conditions in the new compute hardware are interesting and seem clearly explained on page 52. However, one aspect of global memory access which I can not find explained clearly anywhere is whether the access is synchronous or asynchronous with respect to the device code execution.

Under a synchronous model, I mean that a global read would cause the thread warp to block at that point in the device code until the read is completed. This would require there to be other active warps and/or active blocks to hide the 400-600 cycle latency.

Under an asynchronous model, it would start the gobal read and then immediately continue executing the device code until it reached a point at which the read value is used, where it would then have to wait until the read had completed. (This is like asynchronous message-passing in MPI.) In this model, the read latency could be hidden for a single block with a single warp by having enough compute to be performed between the global read and the first use of the data read.

The same question arises also for global writes.

Based on the comment about the thread scheduler in the final paragraph of 5.1.1.3 I suspect it may be the synchronous model, but I’d appreciate it if someone could clarify this.

Sorry if you think I should have posted this to the regular forums, but it could easily be something which has changed in the new hardware since you have clearly made significant changes to global memory access – that’s why I have posted it here.

Thanks,

Mike

Mike,
it’s asynchronous according to your definition – an instruction blocks when one of the operands isn’t ready. Writes are non-blocking (asynchronous).

Massimiliano

As far as I understood, it has been like this since at least CUDA 1.0

The architectural term for this is “scoreboarding”. The processors on our GPUs are fully scoreboarded. Also, the compiler tries to schedule available non-dependent instructions after high-latency instructions (such as global loads) in order to better hide latency. Note that loads can be used to hide the latency of other loads. For example, if you have 4 loads per thread, it is better to do this:

load( a )

load( b )

load( c )

load( d )

math( a )

math( b )

math( c )

math( d )

Than this:

load( a )

math( a )

load( b )

math( b )

load( c )

math( c )

load( d )

math( d )

Let’s say each instruction takes 4 cycles per warp to issue, and the load latency is 400 cycles, and you have 25 warps. Then the first case will finish math(d) after 800 cycles, with zero stalls. The second case will finish math(d) after 2000 cycles, with 60% of those cycles spent stalled waiting on loads.

The compiler tries to do the first case. :)

Mark