Global memory overhead

Mr_Maritini · February 9, 2008, 2:11pm

The Programming Guide in Section 5.1.1.4 says that the overhead associated with a global memory access is 400-600 clock cycles.
Later on it says that 128 bit coalesced global memory access bandwidth is significantly lower than that associated with 32 bit coalesced accesses. In 5.1.2.1 the guide also indicates that non-coalesced global memory access has one order of magnitude lower bandwidth compared to the 32 bit coalesced case.

I was wondering whether the 400-600 numbers above refer to the 32 bit coalesced case, which is the best case scenario, or to the worst case scenario of non-coalesced 128 bit global memory access.

Thank you for taking the time.

MisterAnderson42 · February 9, 2008, 3:41pm

Throughput (bandwidth) and latency are different beasts altogether. With enough threads in flight, latency is almost completely hidden so it doesn’t matter in the end.

Throughput does matter greatly for performance, though.

I did complete testing of all the various ways to read/write 32-bit, 64-bit and 128-bit values: [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...41&#entry290441[/url]

Martini · February 9, 2008, 5:58pm

Thank you for your post, MisterAnderson.

I’m new at the GPU game and try to learn more about this. Is is possible for you to post as an attachement the code that you used in conjunction with what you called “the full gammut of tests” (your Dec. 6 post)?

Thank you, have a good day.

MisterAnderson42 · February 9, 2008, 7:24pm

It’s already posted. Just scroll down a little lower in that thread.

Topic		Replies	Views
Global memory access cost CUDA Programming and Performance	4	3001	November 18, 2017
global memory latency CUDA Programming and Performance	12	16189	December 13, 2007
Uncoalesced global memory bandwidth CUDA Programming and Performance	3	2239	March 28, 2009
About global memory CUDA Programming and Performance	0	1924	October 19, 2008
Global memory access time Time to read from global to share memor CUDA Programming and Performance	4	3243	July 16, 2007
Dependent global memory reads CUDA Programming and Performance	2	2770	October 22, 2008
Global memory latency CUDA Programming and Performance	0	795	January 9, 2012
coalesced data accesses in global memory CUDA Programming and Performance	1	940	May 11, 2010
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4740	January 7, 2008
About coalescing CUDA Programming and Performance	6	2635	April 16, 2010

Global memory overhead

Related topics