2D spatial locality for L2 cache on Fermi

I currently use a 9500GT and have a program (ray-tracer) which uses lots of semi-random accesses from global device memory. I’ve recently been happy at the 2-3x speedup that I’ve gained by transferring the (too-big-for-shared) memory from global to texture (using cudaArray and then tex2D() to access the data). By “semi-random” I mean reads which are usually ‘close’ to the last read, but not always.

First off, I’m assuming it is just the single last global ‘read’ which is taken as the basis for the cache’s choice of what surrounding area to cache - for example if x,y is read, then area approx. x-10, y-10 to x+10, y+10 is cached).

Obviously though, the speedups just leave you wanting yet more ;)

Apparently, Fermi uses an L2 cache which seems like a perfect replacement for the texture cache since it’s a massive 768k (about 1 or 2 orders higher than the texture cache). However, the showstopper is this: The L2 cache doesn’t seem to support 2D spatial locality according to seibert from this thread.

So my question: I was wondering if there was anyway around this. Maybe use some kind of hilbert space filling curve? Has anyone tried that on the L2 cache? If so, is the speedup worth implementing compared to using texture memory?

Is Kepler and Maxwell planning to use (the option of) 2D spatial access for the L2 cache?

I’m quite sure that it’s not the cache that gets optimized for 2D spatial locality, but the memory layout in the cudaArray. After all, how would you make the cache local in 2D for different strides at the same time?

Thus, I don’t think there’s anything that can be done in hardware. Instead, you can probably get the optimization for 2D locality on current hardware already if you replace the texture references by surfaces, not direct memory reads.

Oh right, so the Z-order/Hilbert curve stuff used for cudaArray is software side only, and the actual physical memory is a simple ordered linear layout? That changes things slightly!

If this is true, then why isn’t the feature of allocating memory in a Z-order curve format for Fermi already possible for the global-memory-with-L2-cache approach? Maybe a future version of CUDA on the software side will support this?

Obviously, one could always implement it manually, and if so I’d still love to hear comparisons with doing that via the L2 cache, versus using the automated texture-memory/cudaArray approach.

As for these Surfaces you speak of, will that approach use the L2 cache and do the grind work of a Z-order curve automatically, similar to the cudaArray/texture-memory approach? If so, semi-disregard my 2nd paragraph.

I assume there it gets some help from the hardware, just because the address calculations are so much simpler/faster to do in hardware. But that’s all hidden behind the cudaArrays because apparently Nvidia believes they’ve done something really clever there that the competition should not find out about.

I’m surprised noone has reverse engineered that stuff yet. Should be really easy unless the driver makes use of the MMU to prevent direct access to the memory underlying the cudaArray. Then again, there’s probably not much incentive to find it out. Certainly I’ve got better things to do.

Isn’t it possible right now along the lines I indicated above? Haven’t tried it myself, though.

I’d think so. But again, I haven’t tried myself.

Thanks. When I get Fermi/Kepler, I look forward to trying it out.

Just to clarify, when you use the texture sampling functions both the texture cache and the L2 cache are used and as long as you use a 2D CUDA array both of these will automatically be optimised for 2D locality because of the layout of the CUDA array. You would only want to implement this manually if you were not using the texture sampling functions for some reason (either because of the limited throughput or because you need to do writes instead of reads). At least that’s my understanding.

Edit: Even then I think tera is right in saying that surfaces will do all this for you. In other words, surfaces are like textures except they are read/write, don’t support filtering or normalised coordinates, use the L1 cache rather than the texture cache and do not suffer the same limited throughput of the texturing units.

Thanks that’s interesting too.

Do you mean the L2 cache aswell/instead, as it’s the giant 768k of the L2 I’m interested in. The L1 cache is the tiny 64k one AFAIK.

I mean that surfaces use L1 cache + L2 cache and textures use texture cache + L2 cache. Also bear in mind that although the L2 cache is bigger it is shared between all multiprocessors so to get much benefit from its larger size you will need to be very careful that all the multiprocessors are accessing nearby regions of memory. Probably the best you can do is to make sure that consecutive thread blocks access nearby regions of memory.

Right, thanks.