Global Memory Fetches How to arrange them in code for best performance

The core question here … what happens exactly on a Global Memory Fetch …

does the thread/thread warp stall and leave space for others until the value arrives (then all threads eventually waits for their values to arrive), or it continues execution till it actually uses the value

ie: consider the following code

int x = GlobalArray[i];

y += z;
l *= u;

n += x;


int x = GlobalArray[i];
n += x;

Do they have the same performance or is the first one faster

And I hope you also supply the location you’ve got the information from … cause i don’t think I’ve seen in anywhere near the programming guide and the best practice documents



Your examples will most likely run equally fast, as the compiler will optimize both to the same code.

Check out this recent thread:

Well I’ve seen the topic, but It only talks about pipelining the memory access, well I didn’t mean that. And whether or not the compiler is going to optimize I do want to know how the actual mechanism works in hardware.

Does the threads stall until value arrives, or it goes on doing other stuff. Refer to the above example that I wrote earlier.


As I wrote, both is going to happen. It is officially documented that other warps are scheduled while one warp is waiting for values from memory. And according to the measurements in the thread cited above, the warp also continues to run until the value from memory is actually used.…icikevicius.pdf
Specifically, look at the slides for the Launch Configuration topic.

So it is officially documented, that is good to know. As to the OPs question, it depends on what is in the … . If it is just a series of non-memory instructions then they should perform equally as per teras comment. If it contains independent memory operations that are not provable independent, then the first version could be faster…


here is the line I was looking for

I hope you add it to the programming guide or the best practice.