Texture access... Fetch size, Cache size & performance

Sarnath · December 23, 2009, 12:32pm

Our app uses 3D float texture. 192 threads per MP.
Each thread accesses unique x,y,z locations.
However subsequent accesses are closer to the previous x,y,z locations accessed by the same thread.

Assuming a texture miss, results in 8 float access (8 directions) per thread then 192 threads will iniitially bring in 19284 bytes ~ 6K.
Slightly closr to 8k - size of texture cache…

However if the hardware does more than 8 float access per texture-miss, then it could result in cache-overflow and hence subsequent accesses may NOT really be cachy-cachey…

Is the number of float access per texture-miss documented somewhere?

Ailleur · December 23, 2009, 2:30pm

I remember reading here that the locality exploitable with tecture fetch is more of a warp-wide locality where elements of a given warp fetch data in some radius rather than time locality where a given thread accesses a same radius and an other thread an other radius.
A you have pointed out, by the time the same thread executes again, the texture cache may well have been flushed as fat as this thread is concerned.

If my memory does not serve me right, some else will pitch in!

Sarnath · December 24, 2009, 4:59am

Thanks for answering. You have brought in some good points.

Let me share my views on this.

If it is the warp-wide locality that matters, what is the need for cache out there???

The warp thing must be associated with coalesced memory access to fetch the data into the cache… I agree that it would increase the performance if all threads in the warp access nearby elements – thus lesser coalesced transactions to memory to bring data… Warp-wide locality matters… but it is not the only thing that matters…

However, I am more concerned about the cache usage - you have got this point correctly. I wish some1 can throw some light here…

THanks

Nico · December 29, 2009, 3:06pm

Even though texture cache is small, it can still provide a significant speedup. For example, if you draw a magnified texture with linear interpolation: in this case it is very likely that a warp requires the same texels for linear interpolation as the previous warp. In this case, only one of the two warps needs to fetch the required data into texture cache while the second warp waits for that transfer to complete before reading from the cache. This reduces the DRAM bus demand. As mentioned in the cuda programming guide, texture cache does not reduce fetch latency but reduces DRAM demand, so the DRAM bus is less likely to be saturated meaning that other warps and/or blocks can already start fetching their data instead of waiting for the DRAM bus to become unsaturated.

Don’t take my word for it though, it’s just my interpretation of texture cache :)

N.

Topic		Replies	Views
Texture cache architecture Line size of texture cache CUDA Programming and Performance	3	2928	August 27, 2008
Texture memory performance CUDA Programming and Performance	4	4972	June 1, 2009
Performance Considerations using Texture Access Does the performance depend on the access pattern? CUDA Programming and Performance	1	1392	August 21, 2009
Benefits of Texture Memory couldnt use them... CUDA Programming and Performance	6	3198	February 13, 2008
Texture cache filled at first cache miss ? CUDA Programming and Performance	3	2738	August 10, 2007
basic texture cache question texture cache: inter- or intra- block? CUDA Programming and Performance	4	3324	January 30, 2008
texture-cache miss ...what happens to the warp? CUDA Programming and Performance	7	6455	October 15, 2008
For what case should I use texture memory? CUDA Programming and Performance	8	2653	May 26, 2010
Understanding GPU caches can't get my head around it CUDA Programming and Performance	12	4769	March 14, 2009
Texture cache characteristics 2D cache size CUDA Programming and Performance	5	6083	May 8, 2007

Texture access... Fetch size, Cache size & performance

Related topics