Latency Hiding Question

Alex.K · May 12, 2011, 11:32pm

Hello,

I’m a little confused about how latency hiding works, and I would appreciate it if someone could enlighten me.

Suppose I write my kernel so that for a given Warp I have the following operations:

Load some values V_G from Global to Shared Memory (high latency).
Do some stuff with other variables which are already present in shared memory (potential time of execution lower than the latency of step 1).
Now use the values V_G loaded at step 1 for further processing.

The question is, does my warp work on the operations from Step 2, before the V_G variables from global memory reach the shared memory --since it can do that, as Step 2 does not depend on the V_G values-- or does the warp wait for the variables V_G to be loaded into shared memory before doing the work on Step 2?

The CUDA programming guide says the following:
“instructions are pipelined, but unlike CPU cores they are executed in order and there is no branch prediction and no speculative execution”

The above fragment seems to suggest that the GPU does indeed do Step 2 while waiting for Step 1 to finish loading – but my memory about the details of what pipelining implies are fuzzy enough that I need somebody to confirm it (or deny it).

(I know that the GPU can hide latency by scheduling other warps on the SM, warps that are ready for execution – but I am interested in latency hiding within one warp.)

tera · May 13, 2011, 12:03am

Doing 2. while 1. has not yet finished would require the scoreboarding logic to keep track of individual locations in shared memory. I have not systematically investigated this. But in the cases I have tested, timing indicated that this is not the case.

Loading to registers, doing other stuff in shared memory while the load instruction has not yet retired and the storing the registers to shared memory should work though. I’m not sure about this but on compute capability 2.x devices the compiler might even do this automatically.

Alex.K · May 13, 2011, 2:04pm

Thank you, that is helpful.

So, loading from global memory to registers, doing stuff on step 2, and then writing the registers with the V_G data to shared memory would hide some latency within the warp.

I will try to test this strategy versus the strategy of loading directly to shared memory in step 1 and see what happens.

Thank you again.

Topic		Replies	Views
Basic question about hiding latency CUDA Programming and Performance	6	2169	July 9, 2014
What happens for load instructions ? CUDA Programming and Performance	3	5546	July 22, 2011
latency hiding How much speedup can you get? CUDA Programming and Performance	3	9726	November 10, 2007
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14236	November 18, 2008
Questin regarding latency CUDA Programming and Performance	6	4298	August 26, 2010
Design Pattern for Hiding Global Load Latency CUDA Programming and Performance	4	4330	January 24, 2008
hiding global memory access do I need 2 warps? CUDA Programming and Performance	1	977	January 22, 2010
Hiding Tex2D memory latency Tex2D memory latency CUDA Programming and Performance	1	1914	December 7, 2007
Hiding memory read latency CUDA Programming and Performance	0	1754	July 16, 2007
Instruction Latency CUDA Programming and Performance	18	43936	January 18, 2010

Latency Hiding Question

Related topics