Relevance of tex2D() on Fermi Tex instructions are less important on Fermi, but are they obsolete?

Browsing through topics related to the new cache hierarchy in Fermi, I get the impression that 1D texture fetching with tex1Dfetch() is now obsolete and tex1D() is mostly obsolete if texture filtering is not required. I’m wondering if the same is true for tex2D() fetches that gain 2D spatial locality through using the texture cache?

Relevant to this is the fact that NVIDIA did not simply remove the texture cache and redefine the tex instructions as regular reads through the new cache hierarchy plus filtering.

How do you know that they didn’t?

Certainly one thing that needs to stick around is whatever hardware does the mapping of real coordinates to the special space-filling curve used in the cudaArray for 2D and 3D textures. That data layout plays a key role in speeding up accesses with 2D locality.

Yes, though I’m thinking that might be outweighed by the L2 cache being so much larger than the texture cache. 8K vs. 768K.

I just realized, the new Surface functionality in Fermi seems to indicate that NVIDIA is still backing the concept of a separate texture cache. (That is, maybe it’s not just a relic that will disappear in Compute Capability 3.0).

Not even tex1Dfetch is obsolete. It has a smaller cache line size than L1, and there are probably other differences (undocumented, of course, probably because they exist for graphics rendering). I replaced tex1Dfetch with L1 reads in 9/10 kernels in hoomd without any performance differences (actually, a tiny bit faster because of less texture bind overhead). The kernels where L1 works beautifully are ones where memory reads are “almost coalesced” (i.e. contiguous float3 reads), and ones with lots of temporal locality re-use of data from small arrays.

tex1Dfetch beats the L1 cache by 50%+ when it comes to semi-random reads, though (at least in the semi-random pattern that hoomd accesses memory with). Issue one is the 128-byte cache line size. If your random read intersects 5+ 128-byte cache lines, then you’re wasting a lot of L2->L1 bandwidth. It has also been documented on these forums via microbenchmarking that the cost of a tex1Dfetch read increases linearly with the number of cache lines that the read intersects, even when every single read is a cache hit. Even assuming that the tex cache behaves the same way, the smaller line size for it is a huge advantage.

Thanks! Some of that should be copied verbatim into NVIDIA’s Fermi tuning guide :)