global memory prefetch is there any way ?

asm · March 25, 2009, 2:39pm

Hi,

I have kernel whose arithmetic complexity is considerably high.
I wonder if there any way to overlap memory access with arithmetic operations, in other words,
to initiate the memory access while performing other arithmetic operations

for instance, if one uses the following:

gloabal kernel (unsigned *g_mem) {

x = g_mem[thid]; // “prefetch” x

… do a lot of arithmetics that do not depend on ‘x’…

** x is used here **

}

provided that I somehow manage to tell the compiler to fetch ‘x’ in the very beginning of the kernel and not immediately before it gets used,
would then ‘x = g_mem[thid];’ cost me hundreds of clock cycles or the memory access can be (partially) overlapped with succeeding arithmetic operations ?

thanks

StickGuy · March 25, 2009, 3:06pm

According to the programming guide, if you’re able to have 192 active threads per multiprocessor, then you completely hide memory access latency.

E.D_Riedijk · March 25, 2009, 7:30pm

I believe the number of 192 threads is for hiding register read-after-write stalls. It is for hiding the pipeline depth.

MisterAnderson42 · March 25, 2009, 8:16pm

Prefetching, exactly as you describe it with your example code, is automatically done by hardware: even within a single warp.

StickGuy · March 26, 2009, 9:03am

Ah, of course. My mistake.

asm · March 26, 2009, 10:41am

ok, that’s great ))… Because I have a solid piece of arithmetic to insert right after the access to global memory and

according to decuda the compiler seems not to relocate memory load instructions even though the use of loaded data is “deferred”

Jamie_K · March 26, 2009, 6:49pm

This may seem pedantic, but do you really mean by the hardware, or by the compiler? I know the hardware can execute lots of stuff in other threads after the load instruction is issued and before the data is ready, but what you describe sounds like out-of-order execution (meaning instructions issued out of order). Can you clarify?

MisterAnderson42 · March 26, 2009, 8:51pm

I’m just repeating what has been stated on the forums by NVIDIA reps. It has been mentioned many times that nvcc tries to push load instructions as far up as it can in order to leverage this prefetching mechanism. I don’t think this is mentioned anywhere in the programming guide, but I haven’t read the new 2.2 one from top to bottom yet.

And note that prefetching in this manner most certainly does not need to be done with out of order execution. The issue and execution are still in order. The global memory load instruction is issued and then execution continues. Only when the register loaded is to be read is that warp put in a “sleep” state until the global memory from the load arrives. Even in the absence of prefetching, the hardware will need to put the warp into this “sleep” state until the result arrives, so the additional complication is minimal.

Jamie_K · March 26, 2009, 9:54pm

Ok, thank you, that’s what I thought.

Because the hardware can continue executing past the load until the value is actually needed, the compiler can (and does) reorder the instructions so that the load is issued earlier.

Topic		Replies	Views
Is it possible to overlap memory access and computation inside the same kernel? CUDA Programming and Performance	5	1193	September 30, 2022
Boosting Application Performance with GPU Memory Prefetching Technical Blog	7	1307	March 9, 2023
Global Memory Fetches How to arrange them in code for best performance CUDA Programming and Performance	6	1280	June 2, 2010
hiding global memory access do I need 2 warps? CUDA Programming and Performance	1	991	January 22, 2010
Does the prefetch instruction delay the loading of the ld instruction? CUDA Programming and Performance	5	242	August 9, 2024
latency hiding How much speedup can you get? CUDA Programming and Performance	3	9754	November 10, 2007
Performance loading overlapping values of global array within warp CUDA Programming and Performance	5	744	August 20, 2017
Some issues regarding the use of prefetch in the cuda kernel CUDA Programming and Performance cuda , kernel	19	353	June 11, 2025
Latency Hiding Question CUDA Programming and Performance	2	1705	May 13, 2011
Hiding memory read latency CUDA Programming and Performance	0	1766	July 16, 2007

global memory prefetch is there any way ?

Related topics