The core question here … what happens exactly on a Global Memory Fetch …
does the thread/thread warp stall and leave space for others until the value arrives (then all threads eventually waits for their values to arrive), or it continues execution till it actually uses the value
ie: consider the following code
int x = GlobalArray[i];
y += z;
l *= u;
…
…
…
…
…
n += x;
OR
int x = GlobalArray[i];
n += x;
…
…
…
…
…
Do they have the same performance or is the first one faster
And I hope you also supply the location you’ve got the information from … cause i don’t think I’ve seen in anywhere near the programming guide and the best practice documents
Well I’ve seen the topic, but It only talks about pipelining the memory access, well I didn’t mean that. And whether or not the compiler is going to optimize I do want to know how the actual mechanism works in hardware.
Does the threads stall until value arrives, or it goes on doing other stuff. Refer to the above example that I wrote earlier.
As I wrote, both is going to happen. It is officially documented that other warps are scheduled while one warp is waiting for values from memory. And according to the measurements in the thread cited above, the warp also continues to run until the value from memory is actually used.
So it is officially documented, that is good to know. As to the OPs question, it depends on what is in the … . If it is just a series of non-memory instructions then they should perform equally as per teras comment. If it contains independent memory operations that are not provable independent, then the first version could be faster…