variable cache line width ?

Is it possible to change the cache line size?
I have data grouped in either 8 int or 16 int chunks, neither fitting well with
a 32 int (128byte) cache width.
Is it possible to change or decrease cache widths?

Many thanks

The L1 cache line width is fixed at 128 bytes.
The L2 cache line width is fixed at 32 bytes.

So assuming you access your data in the chunk sizes you have outlined, you should still be making efficient use of L2.

Other caches like texture, read-only, etc. do not flow through L1 but they do flow through L2, therefore there is an underlying minimum fetch-from-dram granularity of 32 bytes that underlies most operations. Depending on your access patterns and GPU, you might want to try using texture or read-only.

Again, depending on your GPU, you can also disable L1 (e.g. Fermi GPUs). But this would might be cutting off your nose to spite your face if you did it program-wide. It’s possible using PTX to control L1 caching in Fermi, I believe, on a load-by-load basis. Again, it may still not give you any perf improvement. Kepler GPUs have their L1 disabled already for global loads, so this should not be an issue on Kepler (cc3.0/cc3.5) GPUs.

oh interesting. I did not know of the difference in line widths between L1 and L2.

I’m not sure I understood your comment about textures and read-only cache.
Do they have the same cache line width as L2 (ie 32 bytes)?

Will continue investigations at this end.

Many thanks

AFAIK they do not (at least for texture cache, I don’t think it has the same cacheline size, someone else may come along and set me straight). However, a miss in either one, that is also a miss in L2, will result in an L2 cacheline load at L2 load granularity (32 bytes). Therefore the effective load-from-dram granularity of anything backed by the L2 cache is effectively the L2 cacheline width, at a minimum.

I find the diagram depicted in the answer here to be instructive:

Thanks Bob,
What I was worrying about was my code reading in 8 (or 16) ints via a L1 or
texture cache, forcing the a whole L1 (or read-only) cache line (32 ints) to be read,
even though it will at most use only 25% (or 50%) of the data read. It sounds like
(with compute level >= 3.0) with global loads this will not happen. Even if the number of
active blocks per SMX is modest (block size 64), there should be a reasonable
chance of the 8th (or 16th) int still being in L2 when my code requests it.

In the hope of using the bandwidth between L2 and each SMX better I have tried reading
4 ints in a go. Unfortunately this only gave a modest improvement.

Thanks again