Question on the L1 caching of the GK 110

Hi

I’m currently writing a term paper on the GK 110 architecture. Therefore I’m wondering how the L1 cache requests behave:
First possibility: The L1 cache request is broken down into cache lines and all needed cache lines are completely transferred to the load store units. Thus, if the threads of a warp don’t access the L1 cache in a coalescing manner, much of the L1 bandwidth is wasted.
Second possibility: This wasting would be avoidable, if the L1 cache didn’t always transfer a complete cache line, but just those bytes in a cache line, which are actually needed. Since the L1 cache and the shared memory use the same hardware, at least a part of the L1 cache hardware should be able to do so.

Which of both possibilities is true?
I’d suppose that the first possibility is the truth, but I’m actually not sure.
Thanks for help in advance! :)

Regards Fiepchen

My understanding L1 cache works as one wide bank - not as 32 narrow banks as in shared memory. I.e. (1).

For Kepler GK104 and GK110 (compute 3.0 and 3.5), L1 cache is used for register spills, and only for register spills. So the L1 cache line is always read fully coalesced by the warp since every thread accesses its own word in the cache line, and never any other in the same cache line.

For Fermi, the L1 is more traditional, backing up global memory also cached in L2. Threads may indeed efficiently access L1 in an uncoalesced manner. As long as the threads all read from the same cacheline, the swizzling of order within that line is free. If threads read from different cache lines, the accesses are serialized, though there’s some sublinear scaling in those serializations (ie, reading from 32 different cache lines is not 32 times as slow, but instead 8 times.)
See for example some code and results of old experiments I did here: https://devtalk.nvidia.com/default/topic/476667/cuda-programming-and-performance/why-texture-memory-is-better-on-fermi-/post/3401148/#3401148
An interesting behavior is that Fermi SM 2.0 L1 access has significantly lower performance than Fermi’s shared memory access, even though it’s using the same hardware. Fermi SM 2.1’s L1 performance is comparable to shared.

Wow, this is new! But seems to be true… I wonder how is that possible? Was it in the white papers? Makes no sense to me.

Btw the new read-only L1 cache is still effective. A curious note about that one - when I run 1 thread, it sees only a quarter of it - 12KB.

Thank you very much for your answers!

Are you sure about this? Since NVIDIA writes concerning Kepler in its Kepler tuning guide:

So it seems like GK-110 caches local memory in general in the L1. Thus if a warp accesses a local array or struct randomly the coalescing will be gone.

This is also currently bothering me, since NVIDIA’s statements are kind of contradictory here. Are the global loads indeed not cached in L1 on the GK104 as well?
On the one hand there’s e.g. a visual profiler metric about the global L1 caching in compute capability 3.x but on the other hand there’s the above quote about Kepler’s L1-Caching.

On compute capability 3.0 and 3.5 devices the L1 cache is only used to cache local memory (register spills and auto variables). Global memory is not cached in L1. On compute capability 3.5 devices global memory can be cached in the read-only global cache (texture cache).

Thank you for your help. But this raises one more question. There is the following figure in the programming guide, which shows “Examples of Global Memory Accesses”:

So in accordance with this figure on a device of Compute Capability 3.0 only 128 Byte sized blocks are transferred between DRAM and cache. But this kind of surprises me, because L2 cache line size “should” also be 32 Byte on a device of Compute Capability 3.0 and thus 32 Byte sized blocks “should” be transferred on a cache miss. That’s why a device of Compute Capability 3.0 should behave just the same way as a device of Compute Capability 2.x with deactivated L1-caching does.

So is this figure correct? And if yes, why?

No. Because local memory is local and private per-thread. So one thread of a warp can never read another thread’s data. This means their memory access to those cached values are completely coalesced and using the exact same cache line. Multi-word structures and arrays are not stored sequentially in memory, but instead fully strided so that a thread always reads from its own bank and therefore there are no misalignments that can cause a bank conflict.

An interesting and subtle side question is how L1 local memory changes on Kepler if the shared memory is set to 64-bit banking mode versus 32 bit banking mode. Is the local memory striding changed? Or is L1 independent of the shared 32/64 width mode? A 32-bit word local memory array would be weird to access if the L1 was in 64 bit banking mode… it’d probably use a stride of 64 words. But then it’d be wasting half the storage in the cache lines unless the compiler was really clever about stuffing other data in the unused single-word lane remaining.

No, global reads are not cached in L1 in either GK104 or GK110. Only register spills, local memory, and stack, which are all private per-thread. (I was wrong when I said before that it was only register spills, but the other accesses have the same private per-thread design.)

Yes, it’s correct. L2 only stores full cache lines of 32 words (128 bytes).
But what you’re thinking of is 32 byte read transactions. Those are not reads from global to L2. Those are reads from global or L2 to your threads on the SMX.
This gets tricky to understand, and the best guide is not in the online docs, but in the always-excellent marathon “Ninja Tuning” GTC presentations by Paulius Micikevicius. His 2012 slides are here… the 2013 slides don’t seem to be online. http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0514-GTC2012-GPU-Performance-Analysis.pdf

Thank’s again for your answers. But I’m afraid I still don’t understand it completely.

This part was already clear to me from the beginning.

But those parts:

How can this be? Let’s look at the following example:

float LocalArray[32];
float F1 = LocalArray[0]; //Fully coalesced, since LocalArray is strided, so that all threads access a word in the same cache line
float F2 = LocalArray[threadIdx]; //No coalescing possible; only one word of each requested cache line is needed

Where am I wrong here?

This was kind of my first question. So is local-memory transfer between the LSUs and L1-cache further broken down from cache line requests into bank requests in order to avoid the wasting of L1 bandwidth in the case of uncoalesced accesses? (I’d especially like to have some “official” source for this part, since i couldn’t find this in NVIDIA’s documentations at all)

So disabling the global L1 caching on CC 2.x just reduces the transaction size between L2 and SMX from 128 byte to 32 byte and doesn’t affect the transfers between L2 and Device-Memory? Thus disabling L1 caching primarily saves L2 bandwidth and not DRAM bandwidth?

I presume the best you can do to find out is write a small microbenchmark that does lots of random reads in local arrays and see what speed you get. Then you match that with either model of the cache’s workings to see how it probably works.

Nvidia usually does not comment of these kind of design decisions. At least until someone works it out via the above procedure and publishes his findings.

Fiepchen, you may be wrong in your assumption of how local memory is laid out in the global memory space.

When you have 1 warp with each thread having an array LocalArray[32] in local memory, you are talking about 1024 numbers in total. Essentially, you are dealing with a 2D array. One dimension of the array is the index in LocalArray, another dimension - the thread index. It is not very difficult to figure out how to store this 2D array in memory so that all accesses are coalesced.

Ok, thanks.

How? I’d even say it is impossible, except for the array of the whole warp fitting into a single cache line.
Even the NVIDIA guide writes about coalescing of local memory:

So I’d deduce, that coalescing won’t work (well) if the threads within a warp access different indices.

Oh yeah, to be coalesced you should access “same” local memory variable in all threads - such as in register spills and stack data. I guess this is what Steve (and I) was referring to. If you use local memory explicitly and index it using thread ID, it won’t be coalesced. I thought this usage pattern is not very common. Do you happen to have any examples where it is used? Just wonder…

No, sorry this was just an example, where uncoalesced accesses were supposed to happen according to my understanding, in order to see whether I was wrong again.

But on the other hand, placing an array in local memory and accessing it randomly (e.g. by using this array as a buffer or stack), is quite common isn’t it? And there those uncoalesced accesses are going to happen.
This was also kind of the reason for my first question. I was curious whether it’s better to place those indexed arrays in shared memory instead of local memory in order to achieve a higher performance in cases where not much coalescing is possible.

Currently it is better to put those indexed arrays in shared memory if they are performance sensitive.

Fiepchen, you’re right that dynamic indexing into a local array will not be coalesced. As Vasily said, this is so rare that I didn’t consider it either! What is most common (for local arrays) is for the indexing to be accessed with compile-time constant indices (often by an unrolled loop). Those are all coalesced.

So specifically answering your question, local arrays aren’t too commonly used. And random dynamic access of a local array is extremely rare! It’s supported, but I’ve never seen it done. It’s quite common in the CPU world where a CPU core is swimming in an ocean of cache, so even a multi-kilobyte dynamic array is still cheap to access. In the GPU world, you have thousands of threads in one SMX. All of those threads share that same dinky little L1, so even a tiny array per thread would explode the L1 cache use into an inefficient extravagance.

However, for the (common) cases where you need a small lookup table, there is the constant cache! This is especially limited in ability, but still has many uses. This is only really useful when all threads in a warp reads the same index, though. This is common for things like small lists of input data (which is why Fermi and later use the constant cache to pass in kernel arguments.)

Ok, thank you.