Global Memory Fetches How to arrange them in code for best performance

Ahmed_Tarek · June 2, 2010, 12:58am

The core question here … what happens exactly on a Global Memory Fetch …

does the thread/thread warp stall and leave space for others until the value arrives (then all threads eventually waits for their values to arrive), or it continues execution till it actually uses the value

ie: consider the following code

int x = GlobalArray[i];

y += z;
l *= u;
…
…
…
…
…
n += x;

OR

int x = GlobalArray[i];
n += x;
…
…
…
…
…

Do they have the same performance or is the first one faster

And I hope you also supply the location you’ve got the information from … cause i don’t think I’ve seen in anywhere near the programming guide and the best practice documents

Thanks.

tera · June 2, 2010, 7:50am

Both.

Your examples will most likely run equally fast, as the compiler will optimize both to the same code.

Check out this recent thread: http://forums.nvidia.com/index.php?showtopic=169246

Ahmed_Tarek · June 2, 2010, 9:59am

Well I’ve seen the topic, but It only talks about pipelining the memory access, well I didn’t mean that. And whether or not the compiler is going to optimize I do want to know how the actual mechanism works in hardware.

Does the threads stall until value arrives, or it goes on doing other stuff. Refer to the above example that I wrote earlier.

Thanks

tera · June 2, 2010, 11:42am

As I wrote, both is going to happen. It is officially documented that other warps are scheduled while one warp is waiting for values from memory. And according to the measurements in the thread cited above, the warp also continues to run until the value from memory is actually used.

paulius · June 2, 2010, 7:08pm

[url=“http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Optimization_Micikevicius.pdf”]http://gpgpu.org/wp/wp-content/uploads/200...icikevicius.pdf[/url]
Specifically, look at the slides for the Launch Configuration topic.

Gregory_Diamos · June 2, 2010, 10:03pm

So it is officially documented, that is good to know. As to the OPs question, it depends on what is in the … . If it is just a series of non-memory instructions then they should perform equally as per teras comment. If it contains independent memory operations that are not provable independent, then the first version could be faster…

Ahmed_Tarek · June 2, 2010, 11:09pm

Thanks

here is the line I was looking for

I hope you add it to the programming guide or the best practice.

Topic		Replies	Views
hiding global memory access do I need 2 warps? CUDA Programming and Performance	1	1003	January 22, 2010
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	9091	January 24, 2008
Performance loading overlapping values of global array within warp CUDA Programming and Performance	5	766	August 20, 2017
warp scheduling CUDA Programming and Performance	5	2808	August 7, 2009
Warp Schedulling CUDA Programming and Performance	7	8228	October 22, 2010
Global Memoy latencies and NVIDIA cards Latency CUDA Programming and Performance	15	9088	January 11, 2008
Concurrency of Global Memory Operations CUDA Programming and Performance	1	629	February 17, 2011
Effective global memory bandwidth? CUDA Programming and Performance	17	17768	September 18, 2007
Writes to global memory CUDA Programming and Performance	4	6369	May 10, 2008
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14385	November 18, 2008

Global Memory Fetches How to arrange them in code for best performance

Related topics