Our app uses 3D float texture. 192 threads per MP.
Each thread accesses unique x,y,z locations.
However subsequent accesses are closer to the previous x,y,z locations accessed by the same thread.
Assuming a texture miss, results in 8 float access (8 directions) per thread then 192 threads will iniitially bring in 19284 bytes ~ 6K.
Slightly closr to 8k - size of texture cache…
However if the hardware does more than 8 float access per texture-miss, then it could result in cache-overflow and hence subsequent accesses may NOT really be cachy-cachey…
Is the number of float access per texture-miss documented somewhere?
I remember reading here that the locality exploitable with tecture fetch is more of a warp-wide locality where elements of a given warp fetch data in some radius rather than time locality where a given thread accesses a same radius and an other thread an other radius.
A you have pointed out, by the time the same thread executes again, the texture cache may well have been flushed as fat as this thread is concerned.
If my memory does not serve me right, some else will pitch in!
Thanks for answering. You have brought in some good points.
Let me share my views on this.
If it is the warp-wide locality that matters, what is the need for cache out there???
The warp thing must be associated with coalesced memory access to fetch the data into the cache… I agree that it would increase the performance if all threads in the warp access nearby elements – thus lesser coalesced transactions to memory to bring data… Warp-wide locality matters… but it is not the only thing that matters…
However, I am more concerned about the cache usage - you have got this point correctly. I wish some1 can throw some light here…
Even though texture cache is small, it can still provide a significant speedup. For example, if you draw a magnified texture with linear interpolation: in this case it is very likely that a warp requires the same texels for linear interpolation as the previous warp. In this case, only one of the two warps needs to fetch the required data into texture cache while the second warp waits for that transfer to complete before reading from the cache. This reduces the DRAM bus demand. As mentioned in the cuda programming guide, texture cache does not reduce fetch latency but reduces DRAM demand, so the DRAM bus is less likely to be saturated meaning that other warps and/or blocks can already start fetching their data instead of waiting for the DRAM bus to become unsaturated.
Don’t take my word for it though, it’s just my interpretation of texture cache :)