Where are texture cache(s)? Can texture cache provide a way to localize data between blocks?
In principle: yes. It is a cache after all.
In practice: not really. My experiences and testing indicate that to maximize the performance of the cache, you really only need data locality within the threads of each individual warp.
With data locality among each warp, even semi-random access patterns can achieve 70 GiB/s.
Thanks - in your experience, cache only works per-warp then? 22.214.171.124 agrees:
"…The texture cache is optimized for 2D spatial locality, so [b]threads of
the same warp[/b] that read texture addresses that are close together will achieve best
That makes the cache sound per-warp and not global, but I’m hoping for a cache that can span multiple blocks. Guess it doesn’t work that way!
The cache is very small (8k, I remember), and mainly exists to be able to do fast (bi|tri)linear interpolation. Even if it is global you probably won’t notice that between blocks for this reason.
Yeah, the “per warp” is sort of a natural result of the small cache combined with the interleaved execution of warps. Each multiproc can run 24 warps concurrently: say each thread loads a float4. 24*32 * 16 = 12 288, so we’ve already exceeded the 8k cache. Presuming the texture “cache” operates like a standard cache with some kind of replacement policy for bytes, by the time the last of the 24 concurrent warps runs its data load, values from the earlier warps have already been flushed to make room for others.
In discussions with tachyon_john and others, we came to the conclusion that a term something like “uncoalesced memory reader” was a better term for the texture cache. Maybe “almost coalesced memory reader” would be even better to hint at the needed data locality in a warp’s accesses.