FERMI L1 Information Associativity, Access Pattern


Can someone throw some light on FERMI’s L1 cache. I know it has 128 byte/line and can be configured to be 16/48 OR 48/16.
I want to know:

  1. What is the assocaitivity of this cache?
  2. How many L1 requests can it process at a time (per SM)?
  3. What happens when multiple requests target a single cache-line? Will there be a broadcast?
  4. Can multiple threads read different parts of the same cache-line?
    Should differennt threads be reading different cache-lines?
  5. What is the theoretical L1 cache bandwidth (per SM)

Has someone performed any micro-benchmarks?

Best Regards,

Just thinking loud… Feel free to correct me:

If 32 threads read 4 consecutive bytes each – which is a natural memory-coalesced read sequence that will work well on all NVIDIA GPU architectures – and that will be == 128 bytes.
So, if all threads in a WARP read into a cache-line, they should all get served from a single cache-line.

If this data can be delivered every clock-cycle, it would be (128/1.3)*10^9 (Assume 1.3GHz Clock) == 98GB/s per SM.
If there are 16SMs, this corresponds to 1568GB/s == 1.5TB/s.

Hmm… This is quite a number.

However if the L1 cache also has a 2-clock latency like Shared Memory, then this number would be 750GB/s

On the contrary, If threads in WARP read completely different addresses, then bandwidth depends on the total number of outstanding requests that the L1 cache can respond to. If this is a fraction of 32, then the total bandwidth will also get multipled with that fraction…

All these are just my thoughts. Please help me understand this correctly,

I would assume the L1 itself has similar characteristics as shared memory, given that they’re physically interchangeable (16kb/48kb shared/L1 and 48kb/16kb shared/L1 can be selected at runtime), which means that the performance of L1 memory, assuming a cache hit, would be identical to shared memory, bank conflicts, 2-clock timing, and all.

For cache misses, who knows…


You have a valid point.

The cache line is 128bytes in size. And shared memory changes bank for every 4 bytes.
So had they implemented L1 using Shared Memory Kindaa architecture,
there must be extra banks out there apart from the regular 32.

There must be a total of 128 banks to account for 64K of data (16 + 48).
Each bank has 512 bytes of data.
The L1 cache line of 128 must come from 32 different banks.
Say we configure L1 for 48K.
It would take 3*128 bytes for addresses to wrap back and share the same bank.
So, the entire 48K would look like a set-associative cache with 3 cache-lines and an associativity of 128 (512/4)

May be…

Best Regards,