how is reconfigurable shared memory/cache implemented?

I think Fermi’s reconfigurable shared memory/cache is great, but does it add latency to shared memory access (excluding address decoding due to Fermi’s unified address space)? To me, a reasonable implementation shouldn’t add any latency, not that latency matters that much for GPUs.

Can someone point me to any publications or patents?

Also, how much extra space does the cache need compared to shared memory? A quick calculation says very little space for the tag memory: 128 byte line size means 27 bits/tag = 40(address bits) - 7(line size) - 6(64 cache sets for 8 associativity(my guess) ). 48KiB means 384 cache lines, hence 1296 bytes for tag storage, which is only 2.6% of 48KiB. Maybe it will cost more because the tag memory is very wide to support high associativity.

Why do you think IBM/Toshiba/Sony didn’t implement this for Cell?