How to optimize for cache + shared memory on Fermi?

BlahCuda · April 22, 2010, 9:40pm

Forgive me if I’m mistaken, but one difference between L1+L2 cache in Fermi and shared memory is that the former is managed automatically whereas the latter is user-managed. Assuming what I’m saying is true, how can CUDA developers judiciously utilize the shared memory given that the behavior of the cache is largely unknown? Would we need to resort to a lot of trial and error?

jjp · April 22, 2010, 10:02pm

Fermi still has shared memory just like the previous GPUs, and even though there is an additional L1/L2 cache it will still be vital to utilize the shared memory for re-using local data.

BlahCuda · April 22, 2010, 10:09pm

Right. But how can we judiciously utilize shared memory if the behavior of the L1/L2 cache is unclear? For example, wouldn’t there be instances where data which is already inside the L1/L2 cache is redundantly put into shared memory by the developer?

seibert · April 22, 2010, 10:13pm

We really just need some kind of easy cache-control in CUDA C. If we could mark some global arrays as being uncacheable (because we only use the value once, or we do our own caching in the shared memory), I think that would cover most things. Then the only decision you have to make is “Will this data make good use of the cache?” and let the hardware do the rest.

jjp · April 22, 2010, 10:20pm

Indeed, that is an interesting question. My guess is that shared memory and registers still have a lower access latency. It will still be advantageous to put stuff into shared memory for data that is either frequently re-used or that is not accessed in a pattern that is cache-friendly.

Gregory_Diamos · April 22, 2010, 11:24pm

There is ISA support for this in PTX 2.0. All someone would need to make this happen would be to either write in assembly (and get it working now), or extend nvcc to add an intrinsic for uncached accesses.

seibert · April 23, 2010, 1:28am

Ah right, I forgot that there is (official/unofficial?) support for inline assembly. That might be the best approach for now to bypass the cache.

mikola · April 25, 2010, 1:21am

Does anyone know L1 cash service only it’s own SM or others as well?

This way in order to effectively use it, blocks which access same memory should run on the same SM.

tmurray · April 25, 2010, 6:23am

L1 is per-SM, L2 is across the entire chip.

Topic		Replies	Views
What's the difference between L1 cache and the shared memory CUDA Programming and Performance	4	15156	October 29, 2011
Has anyone written a cache manager? to do implicit caching for shared memory CUDA Programming and Performance	5	2839	December 10, 2007
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23598	March 21, 2011
Fermi: Cache configuration default at compile time From shared to L1 CUDA Programming and Performance	4	19550	April 16, 2010
ptx-isa cache operator to say "L1 only" making shared memory redundant CUDA Programming and Performance	0	3282	January 3, 2012
how is reconfigurable cache/memory implemented? CUDA Programming and Performance	10	3088	December 22, 2010
global memory caching CUDA Programming and Performance	4	1437	March 13, 2012
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26250	December 2, 2011
Fermi L1 Cache coherent? CUDA Programming and Performance	5	14938	May 20, 2010
CUDA: How do I use L2 cache in Fermi? Legacy PGI Compilers	3	5416	June 22, 2011

How to optimize for cache + shared memory on Fermi?

Related topics