texture cache and L2 cache

Hi all, I also confused by how texture cache connects with L2 cache. As we know, texture cache line is a 2D block. However, when one data is not in texture cache, it will look into L2 cache. So in L2 cache, is the cache line for texture also a 2D block? If not, how they communicate with each other? thanks.

Looking at Nsight Memory Statistics it seems like the transaction size requested from L2 is the same size as when coming out of the texture cache (32 bytes). So I’m guessing the transactions are exactly for the same data which jibes with this statement:

“Texture memory is designed for streaming fetches with a constant latency; a texture cache hit reduces device memory bandwidth usage, but not fetch latency.”

Found here:

[url]http://docs.nvidia.com/nsight-visual-studio-edition/4.0/Nsight_Visual_Studio_Edition_User_Guide.htm#Analysis/Report/CudaExperiments/KernelLevel/MemoryStatisticsTexture.htm#Chart[/url]

The Cuda Handbook goes into a little bit of detail of how the 2D locality might be implemented:

[url]CUDA Handbook: A Comprehensive Guide to GPU Programming, The - Nicholas Wilt - Google Books

But I guess the key is to play with the geometry of your requests and try and raise your hit rate and lower your transactions per request.

Thanks for your reply.

I also found some documents. But they shows different situations:

  1. texture cache and then L2 cache.
    http://www.gris.informatik.tu-darmstadt.de/projects/gpu_cache_behavior/data/13rp003-GRIS.pdf

In 4.1, it shows L1 texture cache size is 128B.

  1. texture L1 cache, texture L2 cache.
    http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

In H, you can find that texture L1 cache line is 32B, texture L2 cache line is 256B.

Are these two sayings the same? That is, texture L2 cache is L2 cache(the one used for global load)

Not sure why the first paper claims a 128 byte L1 cache line. The second paper seems more reliable.

So a texture fetch gets you 32 bytes at time from the cache, which gets 32 bytes at a time from L2, which gets 256 bytes at time from device memory. The actual geometry of that data is opaque, but with some tweaking you should be able to find a sweet spot.