Has anyone written a cache manager? to do implicit caching for shared memory

zl25drexel · December 10, 2007, 3:56am

hi all

for a school project i am implementing a DSP algorithm in cuda. At this point i had the algorithm working, but the program does not utilize shared memory. in other words, it reads and writes data directly to global memory.

now i understand that there’s no implicit caching for global memory, if i want to leverage the data access locality by caching data in shared memory, i will have to write custom codes manage the cache.

So, i am wondering, has anyone written a such cache manager? This cache manager can be something that mimic the cache controller of CPUs, in which every access memory access go thru this cache manager, the manager checks whether the data is in cache, if it is then just return data in the cache, otherwise automatically manage the miss by loading data from global memory. Obviously there are other things to consider such as conflict management and what not.

Obvisouly it’s not hard to write one myself, but it seems to be a pretty common task for a lot of cuda projects, so i thought i should ask before i go reinvent the wheel.

seb · December 10, 2007, 4:24am

This sounds interesting. I do not know if anyone ever did this - probably yes.

But I’m guessing that most people write their own customized shared memory utilization code. It’s a lot easier (if your algorithm allows it) to just fetch the data you work on to shared memory, perform calculations and write everything back. Because for many algorithms memory access is predictable there is often no need for a general purpose cache management.

I could also imagine that there are major pitfalls when writing a cache manager. One would have a cache for every block. If those caches overlap this could cause problems. One would probably have to synchronize grid-wide which is no simple task (as discussed many times).

wumpus · December 10, 2007, 2:18pm

Well, yes, you always need to write ‘cache management’ for your specific algorithm. Feasibly implementing a generic cache manager is probably very hard or even impossible given the inherent paralellism of CUDA.

And it’s always a win to manage the shared memory yourself, it’s a very limited resource so you don’t want to fill it up with all kind of cache management book-keeping stuff.

zl25drexel · December 10, 2007, 4:06pm

i agreed with you that witting application specific cache management will yield better optimizations. but having an application agnostic cache manager can make an algorithm to run ‘fast enough’ which is sufficient for many applications. The biggest benefit is that it saves the effort of having to write one for every algorithm. this is essentially the approach CPUs are taking, i think it will be very useful.

as for the bookkeeping stuff, it doesnt take up that much of spaces, i will be more that happy to give up 1k of shared memory space to have an automatic cache manager.

seibert · December 10, 2007, 4:32pm

I think the reason no such cache manager exists is because for most kernels, “caching” is just a simple loop at the beginning of the kernel which copies a block of data from global memory to shared memory, followed by __syncthreads(). In that sense, it isn’t a traditional “cache-on-demand” system, but rather a pre-filled cache. This approach is better for the CUDA architecture because the only way to get good bandwidth to global memory is make coalesced reads, which involves the coordination of many threads in a block.

A shared memory cache that is managed on-the-fly as you read from global memory is likely to have horrible performance no matter which way you do it. Better to figure out how to put your data into a texture and let the hardware cache on the texture units do the work.

wumpus · December 10, 2007, 11:48pm

But if you are satisfied when your algorithm is ‘fast enough’, what’s the use of implementing it on graphics hardware in the first place? It seems most people using CUDA want to get the most out of the architecture, and memory reading optimization is by far the most important way to get speed.

Topic		Replies	Views
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3104	April 25, 2010
Shared Mem caching strategy Comparison of benchmark results CUDA Programming and Performance	9	4273	May 11, 2008
Optimising Device RAM by using Shared Cache Trying to optimise the global memory access by using sha CUDA Programming and Performance	0	857	February 24, 2010
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15122	December 17, 2008
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3963	March 2, 2009
memory confusion how big is local/shared/global memory? CUDA Programming and Performance	6	3488	January 20, 2009
Some confusion on using shared memory. CUDA Programming and Performance	26	9293	June 2, 2009
global memory caching CUDA Programming and Performance	4	1468	March 13, 2012
CUDA texture memory performance CUDA Programming and Performance	4	33659	January 13, 2009
cache data in shared memory for subsequent calls CUDA Programming and Performance	4	4030	May 25, 2010

Has anyone written a cache manager? to do implicit caching for shared memory

Related topics