How to optimize for cache + shared memory on Fermi?

Forgive me if I’m mistaken, but one difference between L1+L2 cache in Fermi and shared memory is that the former is managed automatically whereas the latter is user-managed. Assuming what I’m saying is true, how can CUDA developers judiciously utilize the shared memory given that the behavior of the cache is largely unknown? Would we need to resort to a lot of trial and error?

Fermi still has shared memory just like the previous GPUs, and even though there is an additional L1/L2 cache it will still be vital to utilize the shared memory for re-using local data.

Right. But how can we judiciously utilize shared memory if the behavior of the L1/L2 cache is unclear? For example, wouldn’t there be instances where data which is already inside the L1/L2 cache is redundantly put into shared memory by the developer?

We really just need some kind of easy cache-control in CUDA C. If we could mark some global arrays as being uncacheable (because we only use the value once, or we do our own caching in the shared memory), I think that would cover most things. Then the only decision you have to make is “Will this data make good use of the cache?” and let the hardware do the rest.

Indeed, that is an interesting question. My guess is that shared memory and registers still have a lower access latency. It will still be advantageous to put stuff into shared memory for data that is either frequently re-used or that is not accessed in a pattern that is cache-friendly.

There is ISA support for this in PTX 2.0. All someone would need to make this happen would be to either write in assembly (and get it working now), or extend nvcc to add an intrinsic for uncached accesses.

Ah right, I forgot that there is (official/unofficial?) support for inline assembly. That might be the best approach for now to bypass the cache.

Does anyone know L1 cash service only it’s own SM or others as well?

This way in order to effectively use it, blocks which access same memory should run on the same SM.

L1 is per-SM, L2 is across the entire chip.