Is global memory access cached, at least a little? global memory access

yk_cadcg · September 14, 2007, 2:21pm

Some say there’s a global memory transfer unit called “block”.
I understand that a coalesced access pattern can batch transfers into a single big one and thus reduce the overhead related with each transfer, because cudaMemcpy is block-wise.
But for a non-coalesced individual thread, does it obtain a block, or exactly the data type it requires, each time it access the global memory?
Thanks!

wumpus · September 16, 2007, 9:51am

It fetches just what it requires, there is no caching for global memory.

yk_cadcg · September 16, 2007, 10:57am

thanks. Could any NV friend cast a light on the size of a physical block of global memory? If the basic transfer unit is BYTE, I’m afraid the hardware addressing space/matrix would be too large?
The detailed info is yet fully exposed, is it? Thanks.

wumpus · September 17, 2007, 10:30am

The CUDA programming guide (5.1.2.1) says “the device is capable of reading 32-bit, 64-bit, or 128-bit words from global memory into registers in a single instruction”.
These three can be coalesced over a half warp of 16 threads. So this means the accesses are combined to successively 512-bit, 1024-bit or 2048-bit.

This means that the memory controller can be asked to fetch blocks ranging from 4 to 256 bytes, given they are aligned on this size.

I don’t know what you mean with physical block though.

yk_cadcg · September 17, 2007, 10:59am

it’s only my guess. the memory has CAS-RAS matrix in the memory addressing hardware. If the basic addresing unit is BYTE, then this matrix is fine-granity and too big. If the basic addressing unit is a bigger block, that means locality, so a thread accessing a[i] might spend less time on accessing a[i+1].
I’m not very clear about CAS-RAS, and the GRAM inside is not fully exposed. I heard poeple guessing that there’s a unexposed cache for device memory, but I’ll test it myself. If it’s true, then the threads may take use of this locality. Thanks.

Topic		Replies	Views
coalesced access to global memory block-wise access vs element-wise access CUDA Programming and Performance	0	1513	March 17, 2010
Memory coalescing in one thread CUDA Programming and Performance	17	16674	March 31, 2011
Coalesced Access to Global Memory CUDA Programming and Performance	2	1900	April 13, 2012
Global memory access time Time to read from global to share memor CUDA Programming and Performance	4	3261	July 16, 2007
Newbie question regarding global load CUDA Programming and Performance	2	1668	September 2, 2008
About coalescing CUDA Programming and Performance	6	2654	April 16, 2010
memory latency CUDA Programming and Performance	5	3952	March 21, 2007
About global memory CUDA Programming and Performance	0	1932	October 19, 2008
global memory latency CUDA Programming and Performance	4	2129	June 22, 2008
Local vs Global memory is local memory access always coalesced ? CUDA Programming and Performance	4	4412	June 30, 2009

Is global memory access cached, at least a little? global memory access

Related topics