global memory prefetch is there any way ?

Hi,

I have kernel whose arithmetic complexity is considerably high.
I wonder if there any way to overlap memory access with arithmetic operations, in other words,
to initiate the memory access while performing other arithmetic operations

for instance, if one uses the following:

gloabal kernel (unsigned *g_mem) {

x = g_mem[thid]; // “prefetch” x

… do a lot of arithmetics that do not depend on ‘x’…

** x is used here **

}

provided that I somehow manage to tell the compiler to fetch ‘x’ in the very beginning of the kernel and not immediately before it gets used,
would then ‘x = g_mem[thid];’ cost me hundreds of clock cycles or the memory access can be (partially) overlapped with succeeding arithmetic operations ?

thanks

According to the programming guide, if you’re able to have 192 active threads per multiprocessor, then you completely hide memory access latency.

I believe the number of 192 threads is for hiding register read-after-write stalls. It is for hiding the pipeline depth.

Prefetching, exactly as you describe it with your example code, is automatically done by hardware: even within a single warp.

Ah, of course. My mistake.

ok, that’s great ))… Because I have a solid piece of arithmetic to insert right after the access to global memory and

according to decuda the compiler seems not to relocate memory load instructions even though the use of loaded data is “deferred”

This may seem pedantic, but do you really mean by the hardware, or by the compiler? I know the hardware can execute lots of stuff in other threads after the load instruction is issued and before the data is ready, but what you describe sounds like out-of-order execution (meaning instructions issued out of order). Can you clarify?

I’m just repeating what has been stated on the forums by NVIDIA reps. It has been mentioned many times that nvcc tries to push load instructions as far up as it can in order to leverage this prefetching mechanism. I don’t think this is mentioned anywhere in the programming guide, but I haven’t read the new 2.2 one from top to bottom yet.

And note that prefetching in this manner most certainly does not need to be done with out of order execution. The issue and execution are still in order. The global memory load instruction is issued and then execution continues. Only when the register loaded is to be read is that warp put in a “sleep” state until the global memory from the load arrives. Even in the absence of prefetching, the hardware will need to put the warp into this “sleep” state until the result arrives, so the additional complication is minimal.

Ok, thank you, that’s what I thought.

Because the hardware can continue executing past the load until the value is actually needed, the compiler can (and does) reorder the instructions so that the load is issued earlier.