I’m a little confused about how latency hiding works, and I would appreciate it if someone could enlighten me.
Suppose I write my kernel so that for a given Warp I have the following operations:
Load some values V_G from Global to Shared Memory (high latency).
Do some stuff with other variables which are already present in shared memory (potential time of execution lower than the latency of step 1).
Now use the values V_G loaded at step 1 for further processing.
The question is, does my warp work on the operations from Step 2, before the V_G variables from global memory reach the shared memory --since it can do that, as Step 2 does not depend on the V_G values-- or does the warp wait for the variables V_G to be loaded into shared memory before doing the work on Step 2?
The CUDA programming guide says the following:
“instructions are pipelined, but unlike CPU cores they are executed in order and there is no branch prediction and no speculative execution”
The above fragment seems to suggest that the GPU does indeed do Step 2 while waiting for Step 1 to finish loading – but my memory about the details of what pipelining implies are fuzzy enough that I need somebody to confirm it (or deny it).
(I know that the GPU can hide latency by scheduling other warps on the SM, warps that are ready for execution – but I am interested in latency hiding within one warp.)