2D spatial locality for L2 cache on Fermi

twinbee · January 14, 2011, 12:05am

I currently use a 9500GT and have a program (ray-tracer) which uses lots of semi-random accesses from global device memory. I’ve recently been happy at the 2-3x speedup that I’ve gained by transferring the (too-big-for-shared) memory from global to texture (using cudaArray and then tex2D() to access the data). By “semi-random” I mean reads which are usually ‘close’ to the last read, but not always.

First off, I’m assuming it is just the single last global ‘read’ which is taken as the basis for the cache’s choice of what surrounding area to cache - for example if x,y is read, then area approx. x-10, y-10 to x+10, y+10 is cached).

Obviously though, the speedups just leave you wanting yet more External Image

Apparently, Fermi uses an L2 cache which seems like a perfect replacement for the texture cache since it’s a massive 768k (about 1 or 2 orders higher than the texture cache). However, the showstopper is this: The L2 cache doesn’t seem to support 2D spatial locality according to seibert from this thread.

So my question: I was wondering if there was anyway around this. Maybe use some kind of hilbert space filling curve? Has anyone tried that on the L2 cache? If so, is the speedup worth implementing compared to using texture memory?

Is Kepler and Maxwell planning to use (the option of) 2D spatial access for the L2 cache?

tera · January 14, 2011, 1:10am

I’m quite sure that it’s not the cache that gets optimized for 2D spatial locality, but the memory layout in the cudaArray. After all, how would you make the cache local in 2D for different strides at the same time?

Thus, I don’t think there’s anything that can be done in hardware. Instead, you can probably get the optimization for 2D locality on current hardware already if you replace the texture references by surfaces, not direct memory reads.

twinbee · January 14, 2011, 2:05pm

Oh right, so the Z-order/Hilbert curve stuff used for cudaArray is software side only, and the actual physical memory is a simple ordered linear layout? That changes things slightly!

If this is true, then why isn’t the feature of allocating memory in a Z-order curve format for Fermi already possible for the global-memory-with-L2-cache approach? Maybe a future version of CUDA on the software side will support this?

Obviously, one could always implement it manually, and if so I’d still love to hear comparisons with doing that via the L2 cache, versus using the automated texture-memory/cudaArray approach.

As for these Surfaces you speak of, will that approach use the L2 cache and do the grind work of a Z-order curve automatically, similar to the cudaArray/texture-memory approach? If so, semi-disregard my 2nd paragraph.

tera · January 14, 2011, 4:35pm

I assume there it gets some help from the hardware, just because the address calculations are so much simpler/faster to do in hardware. But that’s all hidden behind the cudaArrays because apparently Nvidia believes they’ve done something really clever there that the competition should not find out about.

I’m surprised noone has reverse engineered that stuff yet. Should be really easy unless the driver makes use of the MMU to prevent direct access to the memory underlying the cudaArray. Then again, there’s probably not much incentive to find it out. Certainly I’ve got better things to do.

Isn’t it possible right now along the lines I indicated above? Haven’t tried it myself, though.

I’d think so. But again, I haven’t tried myself.

twinbee · January 14, 2011, 6:53pm

Thanks. When I get Fermi/Kepler, I look forward to trying it out.

shawkie · January 15, 2011, 12:27am

Just to clarify, when you use the texture sampling functions both the texture cache and the L2 cache are used and as long as you use a 2D CUDA array both of these will automatically be optimised for 2D locality because of the layout of the CUDA array. You would only want to implement this manually if you were not using the texture sampling functions for some reason (either because of the limited throughput or because you need to do writes instead of reads). At least that’s my understanding.

Edit: Even then I think tera is right in saying that surfaces will do all this for you. In other words, surfaces are like textures except they are read/write, don’t support filtering or normalised coordinates, use the L1 cache rather than the texture cache and do not suffer the same limited throughput of the texturing units.

twinbee · January 15, 2011, 1:22am

Thanks that’s interesting too.

Do you mean the L2 cache aswell/instead, as it’s the giant 768k of the L2 I’m interested in. The L1 cache is the tiny 64k one AFAIK.

shawkie · January 17, 2011, 1:21am

I mean that surfaces use L1 cache + L2 cache and textures use texture cache + L2 cache. Also bear in mind that although the L2 cache is bigger it is shared between all multiprocessors so to get much benefit from its larger size you will need to be very careful that all the multiprocessors are accessing nearby regions of memory. Probably the best you can do is to make sure that consecutive thread blocks access nearby regions of memory.

twinbee · January 19, 2011, 12:50pm

Right, thanks.

Topic		Replies	Views
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2255	July 19, 2010
Memory performance in image processing example CUDA Programming and Performance	9	1603	March 24, 2011
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4021	November 3, 2010
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2788	August 9, 2010
Using texture cache or L1 and L2 chache CUDA Programming and Performance	7	1203	November 25, 2010
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9795	December 14, 2011
Relevance of tex2D() on Fermi Tex instructions are less important on Fermi, but are they obsolete? CUDA Programming and Performance	6	2558	March 24, 2011
Convenience of 2D CUDA texture memory against global memory CUDA Programming and Performance	4	4321	January 21, 2013
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4029	December 23, 2015
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26145	December 2, 2011

2D spatial locality for L2 cache on Fermi

Related topics