Not even tex1Dfetch is obsolete. It has a smaller cache line size than L1, and there are probably other differences (undocumented, of course, probably because they exist for graphics rendering). I replaced tex1Dfetch with L1 reads in 9/10 kernels in hoomd without any performance differences (actually, a tiny bit faster because of less texture bind overhead). The kernels where L1 works beautifully are ones where memory reads are “almost coalesced” (i.e. contiguous float3 reads), and ones with lots of temporal locality re-use of data from small arrays.
tex1Dfetch beats the L1 cache by 50%+ when it comes to semi-random reads, though (at least in the semi-random pattern that hoomd accesses memory with). Issue one is the 128-byte cache line size. If your random read intersects 5+ 128-byte cache lines, then you’re wasting a lot of L2->L1 bandwidth. It has also been documented on these forums via microbenchmarking that the cost of a tex1Dfetch read increases linearly with the number of cache lines that the read intersects, even when every single read is a cache hit. Even assuming that the tex cache behaves the same way, the smaller line size for it is a huge advantage.