Do Consecutive ld.global Instructions Have an Order?

If a CUDA PTX code consists of two consecutive ld.global instructions, what happens for a single thread? Does the second ld instruction have to wait for the first ld instruction to complete before it starts reading from global memory?

No, generally speaking these instructions can be issued one after another.

On a GPU, a common reason an instruction cannot/will not issue is due to a dependency from a previous instruction. If the ld.global instruction does not have such a dependency (such as an address calculation based on previously calculated data) then it should be able to issue as soon as the warp scheduler chooses.

The GPU is not an out-of-order machine (currently, anyway) and so an instruction that cannot be issued for a particular warp, for whatever reason, prevents any subsequent instructions from being issued, for that warp. That would be called a stalled warp.

It would rather be the following instructions using the result of the load, which would be waiting.

Sorry, I didn’t express my meaning clearly. What I want to ask is, after the two instructions are issued sequentially, will the global memory data they read arrive in the order of the ld instructions (assuming the two ld instructions cannot be coalesced)?

I don’t know anywhere that such a guarantee is made. It should not matter, at least for correctness, as the GPU keeps some sort of mechanism to identify when dependencies are satisfied.

When I am teaching CUDA, I sometimes show (for understanding) a visual model of how instructions get processed in a SM. I will often state that the memory subsystem is a pipeline, just like most other functionality on a GPU, and when we issue work to a pipeline, its generally reasonable to think that if I issue work item A before B to that pipeline, then the results from A should “appear” before the results from B. The GPU has both fixed latency pipes and variable latency pipes. Since the (global) memory pipe is a variable latency pipe, I wouldn’t personally assume that there is guaranteed ordering on delivery of results. It might be, for example (just making this up) that a request A that goes to partition X which is heavily loaded/busy, but issued first, has results that “appear” after the results from B, that is issued to partition Y, which is lightly loaded, even though B was issued after A.

I don’t think there are guarantees with variable latency pipes, about order of results delivery.

My expectation with a fixed latency pipe is that item A, issued into the pipe before B, will have its results appear before the results from B appear. But I don’t know that that is stated/guaranteed anywhere, either.

The GPU does not “coalesce” or combine the memory activity from separately issued instructions, or from separately issued warps. Coalescing is done by the memory controller in the context/scope of a single instruction, issued to a single warp.

“Arrive” is a relative term. If you want to make them visible somewhere else (e.g. other thread, other block), you have to use fences and memory synchronization functions.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.