In Section F.4.2 of the CUDA C Programming Guide V 4.2,
“A cache line is 128 bytes and maps to a 128-byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.”
Is the above statement correct?
For lines that are cached only in L2, shouldn’t all the 128-bytes be still fetched to fill the L2? 32-byte transactions could may be help in returning the critical word early to the SM, but would not reduce over-fetch.
The statement is correct. At the PTX level, there are load instruction modifiers that indicate whether the load should bypass the L1 cache. If such a instruction modifier is used, then the smaller cache line will be used and less data will be fetched from the device memory. This is not directly available at the CUDA C level, although there are options to nvcc that allow you to globally disable the L1 cache (or the L1 and L2 cache) when compiling CUDA C code.
"... If such a instruction modifier is used, then the smaller cache line will be used and less data will be fetched from the device memory ..."
Does this mean that the L2 cache line size is 32-bytes while the L1 cache line size is 128-bytes? And that would cause a lot of problems.