Cache Characterization: Strange L2 Behavior

I’ve been looking into characterizing Fermi’s caches with a little microbenchmarking, inspired by the “Micro-benchmarking the GT200 GPU” paper and my own curiosity. Borrowing and adapting some of the code from that paper, I made a benchmark that could help find out cache characteristics - set associativity, hit/miss latencies. But the data I got was a little unexpected. The code as well as the charts generated are attached.

One thread steps through varying sized arrays with a set stride. Things I could confirm were the 128B line size, the two settings for L1 cache size, L2 size. I attached the graphs I made of the cache latencies. Strange behavior occurs when transitioning from L1 to L2. I was expecting a generally linear increase in average access times as the data array increasingly exceeds the L1 cache size. However, it plateaus off for a while before reaching what looks to be the L2 cache latency. This behavior occurs in both L1 cache configurations (16k and 48k) but at different sizes and intervals.

Some ideas that have been thrown around but nothing really convincing are:

    possible prefetch - but why it takes that long to learn the access pattern is strange

    effect of replacement policy - I haven’t thought this one through yet

    a NUCA topology - locality might explain a plateau at 48k L1 Cache, but maybe not the fact that it also occurs at a smaller data size when using a 16kB L1.

So mainly, the fact that it occurs in both L1 cache configurations and with different intervals is what throws off these guesses.

Would anyone have an idea or better educated guess as to what is going on? It seems like, if some hardware is actually causing this plateau effect, it would be neat to figure out how to possibly utilize/exploit it.

Oh, this was tested on a GTX480. It may also be interesting to see what happens on other models.

Hopefully I’m not making any stupid assumptions or making mistakes in the code… and if interested in the script used to generate the graphs, it’s here: graph.py - linked b/c I can’t attach
48kB_CacheGraph.png
l1cache.cu (4.83 KB)
16kB_CacheGraph.png

Interesting experiments! For the sake of completeness, I ran your experiments on a GTX470 (results are attached).

Since I’m also interested in the GPU’s cache, I will give this some thought.
gtx470-16.png
gtx470-48.png

I would tend to think this is the most likely.

Assuming hits or misses in the cache always have the same latency, an horizontal line like you have would indicate a constant ratio of hits vs. misses.

From the graph, the hit rate seems to be 1/3 for the 16K configuration and 1/5 for the 48K configuration. Can you compute it more accurately from the raw numbers?

If you are brave enough, you can try to measure the latency of individual (or at least a few) memory requests, instead of averaging over many iterations. The clock() function on the GPU is cycle-accurate. It will take some overhead to call it, but this overhead should be constant and just shift the latency upwards.

You could even check within the kernel which accesses are hits and which are misses (by comparing the time delta with some predefined threshold), and pack the results as a string of bits in shared memory. Then transfer the results back to main memory and analyze it offline.

Easier said than done, I know. ;)

The weird results in the previous architectures (Tesla) have usually come from interaction with the address translation caches. You might want to try distributing your working set among multiple memory pages (try something big like 64KB or 128KB to make sure that you are in a different page) and see if this changes your results. For example, put each cache line in a different memory page at the corresponding offset so that the L1 tags will be the same, but a separate address translation will be required.

Hi, could you please post the code for detecting other parameters of Fermi’s caches too? I’m interested in finding those on my own Fermi cards.
Thanks