global memory access synchronous or asynchronous read/write?

The architectural term for this is “scoreboarding”. The processors on our GPUs are fully scoreboarded. Also, the compiler tries to schedule available non-dependent instructions after high-latency instructions (such as global loads) in order to better hide latency. Note that loads can be used to hide the latency of other loads. For example, if you have 4 loads per thread, it is better to do this:

load( a )

load( b )

load( c )

load( d )

math( a )

math( b )

math( c )

math( d )

Than this:

load( a )

math( a )

load( b )

math( b )

load( c )

math( c )

load( d )

math( d )

Let’s say each instruction takes 4 cycles per warp to issue, and the load latency is 400 cycles, and you have 25 warps. Then the first case will finish math(d) after 800 cycles, with zero stalls. The second case will finish math(d) after 2000 cycles, with 60% of those cycles spent stalled waiting on loads.

The compiler tries to do the first case. :)

Mark