Texture cache architecture Line size of texture cache

I was looking at binding a chunk of linear memory to a texture reference, in order to take advantage of the texture cache. The kernel I am writing will exhibit some 1D spacial locality, as well as temporal locality, so hopefully the texture cache will suit it nicely.

However I am confused about the architecture of the texture cache which I can’t really find any specifics on. Exactly what data will a cache miss bring into the texture cache, and how much, assuming the texture reference is bound to linear memory? That is, will the memory it brings in be a chunk of continuous linear memory, and how much?

And if this is dependent upon GPU architecture, the kernel will most likely be running on either the Tesla C870 or a 9800 GTX.

Thanks for any clarifications!

In my extensive testing, I have found that you can get optimal use of the texture cache with spatially local accesses within each warp. Temporal locality matters not at all, because of the large number of other warps running on that multiproc will cause the first values to be flushed before the first warp comes to the next read.

MisterAnderson or Others,

I am trying to figure out how local texture access has to be to benefit from the cache and am looking for some guidance.

I am considering using a texture fetch on a 2d array of float4s in order to access within one warp four float4s on the same row with a stride of 16 columns. I am thinking of using a 2d array for 1d access for irrelevant reasons. Any idea as to whether such access would have few cache misses? What if I used a 1d texture fetch instead?

It seems like it would be difficult to give a hard answer but if you could help me think about how to approximate this, it would be really helpful.

Thanks,
Danko

If I had to guess, I’d say that is probably close enough to get some benefit of the texture cache. How much benefit? I don’t know. It would only be a matter of minutes to write a quick microbenchmark to test the memory bandwidth you get with such a memory access pattern.

Since you are reading across rows, there would be no performance delta (besides the extra cudaArray setup time needed for the tex2D texture).