Performance Considerations using Texture Access Does the performance depend on the access pattern?

The texture cache is quite small (8 K) and shared by 3 multiprocessors. Cached data will pushed out soon by new data.

Are there more information about how the cache is really working … ?

How is it organized? Is the cache using “cache-lines” like on the cpu and what is ment by locality and
in which directions (2D-access)?

Which access pattern should I use to get most cache hits and the best performance?

The organization depends on how you setup the texture. You can bind a texture directly to global memory for 1D locality, or to a cudaArray for 1D, 2D or 3D locality.

It has been said that the new “pitch-linear” 2D texture bound to global memory still has 1D locality, I haven’t written a microbenchmark to test that for myself, yet.

Just what it says in the programming guide. The best use of the texture cache is to have spatially local accesses among the threads in each warp.

For longer and more explicit descriptions: search the forums for the many other posts on the texture cache by me.…-8&oe=UTF-8