I currently use a 9500GT and have a program (ray-tracer) which uses lots of semi-random accesses from global device memory. I’ve recently been happy at the 2-3x speedup that I’ve gained by transferring the (too-big-for-shared) memory from global to texture (using cudaArray and then tex2D() to access the data). By “semi-random” I mean reads which are usually ‘close’ to the last read, but not always.
First off, I’m assuming it is just the single last global ‘read’ which is taken as the basis for the cache’s choice of what surrounding area to cache - for example if x,y is read, then area approx. x-10, y-10 to x+10, y+10 is cached).
Obviously though, the speedups just leave you wanting yet more
Apparently, Fermi uses an L2 cache which seems like a perfect replacement for the texture cache since it’s a massive 768k (about 1 or 2 orders higher than the texture cache). However, the showstopper is this: The L2 cache doesn’t seem to support 2D spatial locality according to seibert from this thread.
So my question: I was wondering if there was anyway around this. Maybe use some kind of hilbert space filling curve? Has anyone tried that on the L2 cache? If so, is the speedup worth implementing compared to using texture memory?
Is Kepler and Maxwell planning to use (the option of) 2D spatial access for the L2 cache?