global memory caching


I have a question in regards to caching. I read in the programming guide that global memory access is not cached.

So if I have say N blocks with K threads in each block, and I have an array of size K in global memory which I want each block to work from. Then in my kernel if I have each thread read its corresponding element from the array into shared memory and do this across all blocks, will there then be a large performance hit (so no broadcast type optimization like in simultaneous shared memory bank reads and no caching)?


So I think I’ve answered my own question. The problem was that I was reading an old version of the CUDA C programming guide which dealt only with compute capabilities before 2.x (doh!). I’ll include what I’ve found here (if there is an error please correct me).

Global memory reads are cached for devices with compute capability 2.x. There’s an L1 cache (for each MP and is on-chip with shared memory) and an L2 cache (for all MP’s) which apply to global and local memory. Memory requests are broken down into cache line requests, with hits serviced at the respective L1 or L2 throughput and misses serviced at the device memory throughput.

Is this a reasonably correct view?


Reference: CUDA C Programming Guide

Yes, that seems correct.

I thought the shared memory is a cache for the global memory.
Where can I find the volume of the buffer cache for a specific model of GPU?

Physically there are the same the L1 cache and the shared memory. Each stream multiprocessor has 64 KB available memory. 48 Kb of those are used for L1 cache and the rest for shared memory. With compile flags you change that 16 KB for cache and 48 for shared or to 0 for cache and 64 for shared memory.

In addition to this there is the L2 cache which available to all stream multiprocessors. (I hope I did not mixed the L1 and L2 cache)

This is specific to the Fermi architecture. Every new architecture brings new features.