global memory caching

neoavalon · March 7, 2012, 9:08pm

Hello,

I have a question in regards to caching. I read in the programming guide that global memory access is not cached.

So if I have say N blocks with K threads in each block, and I have an array of size K in global memory which I want each block to work from. Then in my kernel if I have each thread read its corresponding element from the array into shared memory and do this across all blocks, will there then be a large performance hit (so no broadcast type optimization like in simultaneous shared memory bank reads and no caching)?

Thanks

neoavalon · March 8, 2012, 1:47am

So I think I’ve answered my own question. The problem was that I was reading an old version of the CUDA C programming guide which dealt only with compute capabilities before 2.x (doh!). I’ll include what I’ve found here (if there is an error please correct me).

Global memory reads are cached for devices with compute capability 2.x. There’s an L1 cache (for each MP and is on-chip with shared memory) and an L2 cache (for all MP’s) which apply to global and local memory. Memory requests are broken down into cache line requests, with hits serviced at the respective L1 or L2 throughput and misses serviced at the device memory throughput.

Is this a reasonably correct view?

Thanks

Reference: CUDA C Programming Guide

tera · March 8, 2012, 2:12am

Yes, that seems correct.

CUDALIKE · March 13, 2012, 11:03am

I thought the shared memory is a cache for the global memory.
Where can I find the volume of the buffer cache for a specific model of GPU?

pasoleatis · March 13, 2012, 11:34am

Physically there are the same the L1 cache and the shared memory. Each stream multiprocessor has 64 KB available memory. 48 Kb of those are used for L1 cache and the rest for shared memory. With compile flags you change that 16 KB for cache and 48 for shared or to 0 for cache and 64 for shared memory.

In addition to this there is the L2 cache which available to all stream multiprocessors. (I hope I did not mixed the L1 and L2 cache)

This is specific to the Fermi architecture. Every new architecture brings new features.

Topic		Replies	Views
Global memory caching CUDA Programming and Performance	6	1059	April 17, 2014
global memory broadcast? reading same global memory location with multiple blocks CUDA Programming and Performance	2	4818	June 6, 2011
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8356	February 17, 2008
Is global memory access cached, at least a little? global memory access CUDA Programming and Performance	4	3167	September 17, 2007
Global memory access requests ordered..? CUDA Programming and Performance	2	584	May 8, 2014
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1173	April 26, 2013
Is Global Memory Access Cached Or Not? CUDA Programming and Performance	3	2358	September 13, 2008
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7559	July 21, 2008
Which memory is used for the stack frame? CUDA Programming and Performance	6	2949	September 29, 2011
shared memory latency CUDA Programming and Performance	7	5933	May 18, 2011

global memory caching

Related topics