Is global memory access cached, at least a little? global memory access

Some say there’s a global memory transfer unit called “block”.
I understand that a coalesced access pattern can batch transfers into a single big one and thus reduce the overhead related with each transfer, because cudaMemcpy is block-wise.
But for a non-coalesced individual thread, does it obtain a block, or exactly the data type it requires, each time it access the global memory?

It fetches just what it requires, there is no caching for global memory.

thanks. Could any NV friend cast a light on the size of a physical block of global memory? If the basic transfer unit is BYTE, I’m afraid the hardware addressing space/matrix would be too large?
The detailed info is yet fully exposed, is it? Thanks.

The CUDA programming guide ( says “the device is capable of reading 32-bit, 64-bit, or 128-bit words from global memory into registers in a single instruction”.
These three can be coalesced over a half warp of 16 threads. So this means the accesses are combined to successively 512-bit, 1024-bit or 2048-bit.

This means that the memory controller can be asked to fetch blocks ranging from 4 to 256 bytes, given they are aligned on this size.

I don’t know what you mean with physical block though.

it’s only my guess. the memory has CAS-RAS matrix in the memory addressing hardware. If the basic addresing unit is BYTE, then this matrix is fine-granity and too big. If the basic addressing unit is a bigger block, that means locality, so a thread accessing a[i] might spend less time on accessing a[i+1].
I’m not very clear about CAS-RAS, and the GRAM inside is not fully exposed. I heard poeple guessing that there’s a unexposed cache for device memory, but I’ll test it myself. If it’s true, then the threads may take use of this locality. Thanks.