Relevance of tex2D() on Fermi Tex instructions are less important on Fermi, but are they obsolete?

RogerDahl · March 23, 2011, 4:13pm

Browsing through topics related to the new cache hierarchy in Fermi, I get the impression that 1D texture fetching with tex1Dfetch() is now obsolete and tex1D() is mostly obsolete if texture filtering is not required. I’m wondering if the same is true for tex2D() fetches that gain 2D spatial locality through using the texture cache?

Relevant to this is the fact that NVIDIA did not simply remove the texture cache and redefine the tex instructions as regular reads through the new cache hierarchy plus filtering.

Gregory_Diamos · March 23, 2011, 11:21pm

How do you know that they didn’t?

RogerDahl · March 23, 2011, 11:55pm

I asked Michael Garland about textures vs. L1 in his presentation for at the VSCSE summer school last week. He confirmed what we are saying here, that sometimes L1 is better and sometimes the tex cache was better for the sparse matrix vector multiply kernels he works on. The interesting thing he added is this: added benefits are possible by making use of both caches in a single kernel. They are independent caches, after all! The idea is to read from one array with tex1Dfetch (or tex2D/3D) and from the others with L1. 1) It limits the L1 cache pollution and 2) It gives you a larger amount of cache memory to read from.

I’ve only got one kernel that performs cached reads from 2 different arrays which I can try this idea out on - it did lead to a slight performance improvement. The improvement likely wasn’t that great because the 2nd array read is not in the inner loop and only performed once for every ~30-40 inner loop random reads.

seibert · March 23, 2011, 11:59pm

Certainly one thing that needs to stick around is whatever hardware does the mapping of real coordinates to the special space-filling curve used in the cudaArray for 2D and 3D textures. That data layout plays a key role in speeding up accesses with 2D locality.

RogerDahl · March 24, 2011, 12:15am

Yes, though I’m thinking that might be outweighed by the L2 cache being so much larger than the texture cache. 8K vs. 768K.

I just realized, the new Surface functionality in Fermi seems to indicate that NVIDIA is still backing the concept of a separate texture cache. (That is, maybe it’s not just a relic that will disappear in Compute Capability 3.0).

DrAnderson42 · March 24, 2011, 12:44pm

Not even tex1Dfetch is obsolete. It has a smaller cache line size than L1, and there are probably other differences (undocumented, of course, probably because they exist for graphics rendering). I replaced tex1Dfetch with L1 reads in 9/10 kernels in hoomd without any performance differences (actually, a tiny bit faster because of less texture bind overhead). The kernels where L1 works beautifully are ones where memory reads are “almost coalesced” (i.e. contiguous float3 reads), and ones with lots of temporal locality re-use of data from small arrays.

tex1Dfetch beats the L1 cache by 50%+ when it comes to semi-random reads, though (at least in the semi-random pattern that hoomd accesses memory with). Issue one is the 128-byte cache line size. If your random read intersects 5+ 128-byte cache lines, then you’re wasting a lot of L2->L1 bandwidth. It has also been documented on these forums via microbenchmarking that the cost of a tex1Dfetch read increases linearly with the number of cache lines that the read intersects, even when every single read is a cache hit. Even assuming that the tex cache behaves the same way, the smaller line size for it is a huge advantage.

RogerDahl · March 24, 2011, 9:11pm

Not even tex1Dfetch is obsolete. It has a smaller cache line size than L1, and there are probably other differences (undocumented, of course, probably because they exist for graphics rendering). I replaced tex1Dfetch with L1 reads in 9/10 kernels in hoomd without any performance differences (actually, a tiny bit faster because of less texture bind overhead). The kernels where L1 works beautifully are ones where memory reads are “almost coalesced” (i.e. contiguous float3 reads), and ones with lots of temporal locality re-use of data from small arrays.

tex1Dfetch beats the L1 cache by 50%+ when it comes to semi-random reads, though (at least in the semi-random pattern that hoomd accesses memory with). Issue one is the 128-byte cache line size. If your random read intersects 5+ 128-byte cache lines, then you’re wasting a lot of L2->L1 bandwidth. It has also been documented on these forums via microbenchmarking that the cost of a tex1Dfetch read increases linearly with the number of cache lines that the read intersects, even when every single read is a cache hit. Even assuming that the tex cache behaves the same way, the smaller line size for it is a huge advantage.

Thanks! Some of that should be copied verbatim into NVIDIA’s Fermi tuning guide :)