Why L1 cache hit ratio become zero on K20?

Hi all,

Does anyone know why L1 cache hit ratio become zero for most CUDA workloads on K20 GPU, such as MonteCarlo. I profiled MonteCarlo on M2090, GTX680 and K20. The workload was with large hit ratio on M2090 and GTX680, however, zero hit ratio on K20. It seems all the memory accesses are uncached. Anyone know the reasons? Is it for cache coherence?

Thanks

As far as I remember, Kepler of 3.5 compute capability and higher does not cache via L1, there is separate cache for read-only data, compiler decides whether particula data can be handled via that cache or not. Pointers must be delcared with const restrict key words in order to let the data that corresponds to them to be handled via that cache.

Hi Romant,

Thanks for reply. Is there any NVIDIA officially document talking about this change? Then, I wondered what’s the usage of L1 cache on K20? Another possible negative impact is the access latency, texture cache or read only cache has a much longer latency than L1 cache.

Thanks

jinhou,

I believe that Romant is correct; the K20 is not caching global memory accesses in L1. The L1 cache is, however, still handling register spills (local memory).

In the CUDA C Programming Guide distributed with CUDA 5.0, section F.5.2 says “Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.”

In other sections, it says that CC 3.x devices store global memory accesses in L1, but I have found that 3.x often refers only to 3.0 and not 3.5. On the other hand, sometimes 3.x seems to refer to 3.5 and not 3.0. For example, section B.1.2 says that CC 3.x devices support dynamic parallelism, but that is only true of 3.5 devices AFAIK.

HTH,
Tom

Hi thenson,

Your info is very helpful for me. Currently, we can make sure that global memory accesses is not cached in L1 on K20. But I may have a further question. Why NVIDIA bypass the L1 cache for global memory accesses on K20?

AFAIK it does not use L1 when you use for ex:

const float* restrict ptr

The data is then read via the “read only data cache”, which is probably the same as using the texture cache.

sm_35 introduced a new way to load data through the texture / read-only cache that is much more convenient than use of classical, bound, textures in conjunction with tex1Dfetch(). There is a new instruction for this called LDG that does not require binding of textures and has no size restrictions like tex1Dfetch(). It is accessible via the intrinsic __ldg(). The documentation of that new intrinsic was inadvertently left out of the CUDA 5.0 documentation, we are in the process of fixing this for the next CUDA release.

Use of the intrinsic __ldg() is the only way to ensure that the LDG instruction is used for a particular access. The use of the existing C/C++ qualifiers ‘const’ and ‘restrict’ facilitates the automatic generation of LDG, it does not (and cannot) guarantee it. The Kepler Tuning Guide says that, we are considering how we could make the relevant sections clearer, but as far as I can see the current description is factually correct.

‘const’ and ‘restrict’ are simply ways of conveying additional information about data objects to the compiler. In particular use of both qualifiers with a pointer says “the object pointed to by this qualified pointer is read-only when accessed through this pointer, and there are no other pointers through which it accessed (and in particular written to), within the scope of this pointer”. That last part is very important.

Since the texture cache is not coherent, safe use of LDG requires that an object is read-only for the entire life time of a kernel. Clearly this is not the same as what is asserted by the programmer through use of ‘const restrict’ except for very simple kernels, in particular those that do not call any device functions. The compiler does analysis to determine whether it is safe to use LDG, and ‘const restrict’ helps it in proving that, but there are various reasons it may not be able to prove that the use is safe (think about separately compiled device functions, for example).

Apprpopriate use of ‘const’ and ‘restrict’ is a best practice that facilitates various optimizations. One of those optimizations is the generation of the LDG instruction. Please note that by using restrict, a programmer makes an assertion / promise to the compiler about the absence of aliasing, so the programmer is responsible for making sure that assertion matches reality.

[Later:]

The __ldg() intrinsic is an overloaded function __ldg (const *T) where T is one of the following built-in types:

char, short, int, long long
unsigned chat, unsigned short, unsigned int, unsigned long long
int2, int4, uint2, uint4
float, float2, float4
double, double2

__ldg() is exported by the header file sm_35_intrinsics.h (which like other such files is auto-included by nvcc when compiling CUDA source).

jinhou,

I cannot say with certainty why global memory accesses are no longer cached in L1, but I expect that it is due to a combination of the low number of cache lines per core and the move to hardware virtualization.

Assuming 48KiB of L1 and a 128 byte cache line (it may be 64 bytes, I’m not sure), there are only 384 cache lines per SMX. At 192 cores per SMX, that is only 2 cache lines per core (and L1 is likely not fully associative, so 2 cache lines per core is an idealization). With numbers like that, data will not live for long at all in L1 before being evicted.

Kepler is also pushing concurrent kernels and virtualization. I am not sure if blocks from two different kernels (possibly from two different processes) can currently be assigned to an SMX simultaneously, but things seem to be headed that way. If you have disparate kernels running on an SMX, then memory locality will decrease and the L1 cache will become less effective.

Thanks Norbert! I’m looking forward to using this new feature as frequently as possible.

Hi njuffa, thanks a lot. So it seemed NVIDIA used read only cache to replace L1 cache on K20. So how about the access latency and bandwidth? As i know, L1 cache on Fermi has only 30 cycles, but texture cache has more than 200 cycles latency. This should have a negative impact on workload performance. The read only cache latency can not be hided if ALU instruction or TLP is not enough.
BTW, i can’t only detected 12KB texture cache and read only cache on K20. But the K20 white paper claimed 48KB size per SMX.

Thanks

Hi tbenson,
Your explain seems make sense. Is it bypassing L1 cache also related to the cache coherence?

Thanks