Latency Hiding Question

Hello,

I’m a little confused about how latency hiding works, and I would appreciate it if someone could enlighten me.

Suppose I write my kernel so that for a given Warp I have the following operations:

  1. Load some values V_G from Global to Shared Memory (high latency).

  2. Do some stuff with other variables which are already present in shared memory (potential time of execution lower than the latency of step 1).

  3. Now use the values V_G loaded at step 1 for further processing.

The question is, does my warp work on the operations from Step 2, before the V_G variables from global memory reach the shared memory --since it can do that, as Step 2 does not depend on the V_G values-- or does the warp wait for the variables V_G to be loaded into shared memory before doing the work on Step 2?

The CUDA programming guide says the following:
“instructions are pipelined, but unlike CPU cores they are executed in order and there is no branch prediction and no speculative execution”

The above fragment seems to suggest that the GPU does indeed do Step 2 while waiting for Step 1 to finish loading – but my memory about the details of what pipelining implies are fuzzy enough that I need somebody to confirm it (or deny it).

(I know that the GPU can hide latency by scheduling other warps on the SM, warps that are ready for execution – but I am interested in latency hiding within one warp.)

Doing 2. while 1. has not yet finished would require the scoreboarding logic to keep track of individual locations in shared memory. I have not systematically investigated this. But in the cases I have tested, timing indicated that this is not the case.

Loading to registers, doing other stuff in shared memory while the load instruction has not yet retired and the storing the registers to shared memory should work though. I’m not sure about this but on compute capability 2.x devices the compiler might even do this automatically.

Thank you, that is helpful.

So, loading from global memory to registers, doing stuff on step 2, and then writing the registers with the V_G data to shared memory would hide some latency within the warp.

I will try to test this strategy versus the strategy of loading directly to shared memory in step 1 and see what happens.

Thank you again.