I am currently studying CUDA.
Acorrding to the 2022 CUDA C Programming Guide, “A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions, whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.”
Based on this, I’ve encountered some questions regarding the granularity of L1 and L2 caches related to global memory access.
If both L1 and L2 cache lines are 128 bytes wide, when caching only the L2 cache, is the amount of data trasferred from globl memory to L2 (L2 granularity) the same as the cache line width of 128 bytes(equivalent to 4 sectors of the cache line) or is it 32 bytes(equivalent to 1 sector of the cache line)?
Additionally, when caching both L1 and L2, is the data transfer amount from global memory to L2 (L2 granularity) 128 bytes? and from L2 or L1 (L1 granularity) also 128 bytes? And for each warp, is the data transfer amount from L1 to the registers also 128 bytes?
Lastly, is it possible to adjust the granularity of L2 cache using the cudaDeviceSetLimit function? If it isn’t explicitly set, does the data only transfer at the default granularity?
best regards, Rawin