variable cache line width ?

wlangdon · January 13, 2015, 7:36am

Is it possible to change the cache line size?
I have data grouped in either 8 int or 16 int chunks, neither fitting well with
a 32 int (128byte) cache width.
Is it possible to change or decrease cache widths?

Many thanks
Bill

Robert_Crovella · January 13, 2015, 2:53pm

The L1 cache line width is fixed at 128 bytes.
The L2 cache line width is fixed at 32 bytes.

So assuming you access your data in the chunk sizes you have outlined, you should still be making efficient use of L2.

Other caches like texture, read-only, etc. do not flow through L1 but they do flow through L2, therefore there is an underlying minimum fetch-from-dram granularity of 32 bytes that underlies most operations. Depending on your access patterns and GPU, you might want to try using texture or read-only.

Again, depending on your GPU, you can also disable L1 (e.g. Fermi GPUs). But this would might be cutting off your nose to spite your face if you did it program-wide. It’s possible using PTX to control L1 caching in Fermi, I believe, on a load-by-load basis. Again, it may still not give you any perf improvement. Kepler GPUs have their L1 disabled already for global loads, so this should not be an issue on Kepler (cc3.0/cc3.5) GPUs.

wlangdon · January 13, 2015, 3:19pm

oh interesting. I did not know of the difference in line widths between L1 and L2.

I’m not sure I understood your comment about textures and read-only cache.
Do they have the same cache line width as L2 (ie 32 bytes)?

Will continue investigations at this end.

Many thanks
Bill

Robert_Crovella · January 13, 2015, 3:27pm

AFAIK they do not (at least for texture cache, I don’t think it has the same cacheline size, someone else may come along and set me straight). However, a miss in either one, that is also a miss in L2, will result in an L2 cacheline load at L2 load granularity (32 bytes). Therefore the effective load-from-dram granularity of anything backed by the L2 cache is effectively the L2 cacheline width, at a minimum.

I find the diagram depicted in the answer here to be instructive:

http://stackoverflow.com/questions/27366359/what-is-the-difference-between-dram-read-transactions-and-gld-transactions-in-cu

wlangdon · January 13, 2015, 5:00pm

Thanks Bob,
What I was worrying about was my code reading in 8 (or 16) ints via a L1 or
texture cache, forcing the a whole L1 (or read-only) cache line (32 ints) to be read,
even though it will at most use only 25% (or 50%) of the data read. It sounds like
(with compute level >= 3.0) with global loads this will not happen. Even if the number of
active blocks per SMX is modest (block size 64), there should be a reasonable
chance of the 8th (or 16th) int still being in L2 when my code requests it.

In the hope of using the bandwidth between L2 and each SMX better I have tried reading
4 ints in a go. Unfortunately this only gave a modest improvement.

Thanks again
Bill

Topic		Replies	Views
Memory Transaction Width and L2 Cache Fill - Compute Capability width 2.x and 3.0 CUDA Programming and Performance	3	1341	June 28, 2012
The granularity of L1 and L2 caches CUDA Programming and Performance cuda	2	1141	April 18, 2024
Memory transaction size CUDA Programming and Performance	1	1734	February 12, 2017
Cache line size of L1 and L2 CUDA Programming and Performance	3	20692	November 14, 2011
Reg: Options for changing L1 cache size in OPENCL CUDA Programming and Performance	5	1973	July 2, 2012
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1690	January 5, 2017
Switch off L1 cache CUDA Programming and Performance	2	3416	March 24, 2015
texture cache and L2 cache CUDA Programming and Performance	3	4343	March 19, 2014
Sometimes smaller blocks may work better Cache overload CUDA Programming and Performance	2	2527	July 7, 2011
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9797	December 14, 2011

variable cache line width ?

Related topics