Has anyone written a cache manager? to do implicit caching for shared memory

hi all

for a school project i am implementing a DSP algorithm in cuda. At this point i had the algorithm working, but the program does not utilize shared memory. in other words, it reads and writes data directly to global memory.

now i understand that there’s no implicit caching for global memory, if i want to leverage the data access locality by caching data in shared memory, i will have to write custom codes manage the cache.

So, i am wondering, has anyone written a such cache manager? This cache manager can be something that mimic the cache controller of CPUs, in which every access memory access go thru this cache manager, the manager checks whether the data is in cache, if it is then just return data in the cache, otherwise automatically manage the miss by loading data from global memory. Obviously there are other things to consider such as conflict management and what not.

Obvisouly it’s not hard to write one myself, but it seems to be a pretty common task for a lot of cuda projects, so i thought i should ask before i go reinvent the wheel.

This sounds interesting. I do not know if anyone ever did this - probably yes.

But I’m guessing that most people write their own customized shared memory utilization code. It’s a lot easier (if your algorithm allows it) to just fetch the data you work on to shared memory, perform calculations and write everything back. Because for many algorithms memory access is predictable there is often no need for a general purpose cache management.

I could also imagine that there are major pitfalls when writing a cache manager. One would have a cache for every block. If those caches overlap this could cause problems. One would probably have to synchronize grid-wide which is no simple task (as discussed many times).

Well, yes, you always need to write ‘cache management’ for your specific algorithm. Feasibly implementing a generic cache manager is probably very hard or even impossible given the inherent paralellism of CUDA.

And it’s always a win to manage the shared memory yourself, it’s a very limited resource so you don’t want to fill it up with all kind of cache management book-keeping stuff.

i agreed with you that witting application specific cache management will yield better optimizations. but having an application agnostic cache manager can make an algorithm to run ‘fast enough’ which is sufficient for many applications. The biggest benefit is that it saves the effort of having to write one for every algorithm. this is essentially the approach CPUs are taking, i think it will be very useful.

as for the bookkeeping stuff, it doesnt take up that much of spaces, i will be more that happy to give up 1k of shared memory space to have an automatic cache manager.

I think the reason no such cache manager exists is because for most kernels, “caching” is just a simple loop at the beginning of the kernel which copies a block of data from global memory to shared memory, followed by __syncthreads(). In that sense, it isn’t a traditional “cache-on-demand” system, but rather a pre-filled cache. This approach is better for the CUDA architecture because the only way to get good bandwidth to global memory is make coalesced reads, which involves the coordination of many threads in a block.

A shared memory cache that is managed on-the-fly as you read from global memory is likely to have horrible performance no matter which way you do it. Better to figure out how to put your data into a texture and let the hardware cache on the texture units do the work.

But if you are satisfied when your algorithm is ‘fast enough’, what’s the use of implementing it on graphics hardware in the first place? It seems most people using CUDA want to get the most out of the architecture, and memory reading optimization is by far the most important way to get speed.