how is reconfigurable cache/memory implemented?

I think Fermi’s reconfigurable shared memory/cache is great, but does it add latency to shared memory access (excluding address decoding due to Fermi’s unified address space)? To me, a reasonable implementation shouldn’t add any latency, not that latency matters that much for GPUs.

Can someone point me to any publications or patents?

Also, how much extra space does the cache need compared to shared memory? A quick calculation says very little space for the tag memory: 128 byte line size means 27 bits/tag = 40(address bits) - 7(line size) - 6(64 cache sets for 8 associativity(my guess) ). 48KiB means 384 cache lines, hence 1296 bytes for tag storage, which is only 2.6% of 48KiB. Maybe it will cost more because the tag memory is very wide to support high associativity.

Why do you think IBM/Toshiba/Sony didn’t implement this for Cell?

I don’t know about latency, but the L1 cache-hit throughput is lower than shared memory throughput even though they both use the same storage cells. Different cache lines were (even if all hit) lowered throughput even more (and again, unlike shared, which can read different lines in different banks simultaneously.) It’s better on GF104 than GF100.

There’s a thread on this forum from a couple months ago where I posted a quick program to test this… Google could probably find it.

I’m curious if GF110 changed anything from GF100… probably not.

Even though caches are not fully coherent on Fermi, you still need some minimal support to guarantee some consistency in case of false sharing. Imagine two threads running on different SMs write a word at two different locations in their respective copies of the same cache line. During writeback to L2, cache lines coming from the L1 of each SM should be merged together, rather than have one line just overwrite the other.

This requires some kind of per-word validity information, which should at least double the tag storage requirements.

Then there is the whole cache management logic.

First, they have a very tight latency budget (high clock rate, low latency), and their design goals are to keep the datapaths as straightforward as possible.

More importantly, they rely on static instruction scheduling from the compiler. It works (well, sort of) because all hardware latencies can be predicted statically.

But cache latency is non-deterministic (at least from a compiler’s viewpoint). Cell has no hardware mechanism that would make it able to tolerate the latency of a cache miss. Which means caches would be so slow that they would be nearly useless.

(at least data caches. Excluding instruction caches was just an unfortunate design decision in my opinion :/ )

By contrast, Fermi’s SMs can easily tolerate very long and non-deterministic latencies, thanks to extensive hardware multithreading.

Right, I completely forgot about coherency hardware.

SPWorley said the throughput for L1 cache hit was less than for shared memory. I suppose the 1st reason could be due to serialization of cache lookups. If a warp does a cache access, there can be up to WARP_SIZE cache lookup operations if every address maps to a different cache line set. If all addresses map to the same cache line set, then only 1 lookup would be needed.

The 1st reason probably doesn’t apply to SPWorley’s case. The 2nd reason could be due to the way they implement cache associativity. It seems high associativity could be very expensive to implement due to the need to read all lines in a set. I don’t think the design would output all cache lines in the same set at once. Therefore, I think it would have to take ASSOCIATIVITY cycles to read out all lines. But that would waste a lot of bandwidth because you probably you’ll only end up using 1 line from that set. So I’m beginning to think the associativity wouldn’t be 8 due to that waste, probably 1 or 2.

Update: After thinking about it, reason 2 shouldn’t be a problem. The entire data set doesn’t need to be read out, only the entire tag set. Therefore the cache can selectively read out the line(s) that match.

I still think a reconfigurable cache/local store would be useful because it would allow incremental optimization (write naive version, then block code for local store). Many people complain how difficult it is to program using only Cell’s local store.

Here’s the quick tool I wrote a while back. It shows that GF100 has 40% lower throughput on L1 cache hit reads than for shared memory. GF104’s L1 has full throughput… there’s no noticeable L1 hit overhead (if it is slower, its less than 2% or so.)

Anyone have a GTX580 who can run it to see if GF110 has changed? I would guess it’s the same as GF100.

SM writes directly into L2. L1 is NOT used.

Indeed, you are right. Global writes always write through the L1. Only local writes can do write back. So no need for per-word or per-byte validity information.

Thanks for the correction!

(For reference, here is what the PTX doc says)

By the way, GF10x is sm_21. Does it behave the same way? (I would think so…)

I don’t know exactly. But “compute profiler” doesn’t have L1 write counter at all. There is no difference between 2.1 and 2.0 here.

L1 and smem throughputs are the same. If experimental data shows otherwise then it’s the result of the code (more specifically assembly being generated and SM issue capabilities). A C2050 (gf100) has 1030 GB/s of bandwidth for both L1 and smem. To test this I wrote a quick program that sums up a bunch of floats, from either gmem or smem. A total of 32 different gmem addresses are accessed. Now, subsequent accesses by the same thread are separated by stride passed into the kernel, which happens to be 0 to guarantee hits after the first compulsory miss. However, one still has to execute instructions to update the address being loaded from as well as consume the values loaded. As a result, about a quarter of instructions end up being shared memory accesses. So, with dual-issue you are issuing an L1 requests in roughly half the clock ticks, thus observing roughly half of theoretical L1 bandwidth (I see 505 GB/s on my C2050). This is where gf104 might help, as it is capable of quad-issue (and for some instructions has 50% more pipelines per SM), thus allowing issuing L1 requests in a much higher percentage of clock ticks.

For shared memory, you can get reduce the number of non-smem instructions by declaring shared memory as volatile. This will force the compiler to load from smem every time, even if index isn’t changing (so we can get rid of the index math). With volatile and no index math I see 943 GB/s, with index math I get 497 GB/s.

The key is that with index math both smem and gmem kernels run at over 98% of instruction throughput rate. (2.7 KB)

Makes sense. Good explanation. I was disappointed with the GF100’s low instruction issue width. The 2 warp schedulers combined can only issue 32 operations (mix of arithmetic, load/store) a clock cycle. This won’t keep all the execution units busy because ideally, the 32 ALUs and 32 shared memory banks can work in parallel.

Well I think I might have found an answer to part of my question:

On p.23 of the IBM Cell SPU ISA guide 1.2, it says

If you look at a Cell die and compare the L2 cache area with the local store area, the ratio seems to be ~2.6.

Wow that’s a pretty high cost compared to the 2.6% I calculated. Sylvain mentioned the Fermi L1 caches aren’t coherent until after __threadfence() or end of kernel. Would this reduce the cache hardware cost?