Have you actually benchmarked this? I would be surprised if it made a difference. The texture cache doesn’t work quite the same way as a CPU cache which performs better in a forward mode because prefetches are the next cache line.
I tested it in my app anyways to be sure. It runs 64,000 indpendant loops through a data array of 64,000 elements. Each loop reads in 100 elements from the array on average.
Time for one full iteration:
Completely random distribution of data accesses:
0.005549976 s <-- sorted beginning to end
0.005540606 s <-- sorted end to beginning
“sorted” data so that data accesses have spatial locality within warps
0.003525679 s <-- sorted beginning to end
0.003536046 s <-- sorted end to beginning
The differences between going beginning to end and end to beginning are clearly within the noise. To give you a feeling of the memory performance of my app, the randomly distributed data leads to memory read performance ~40GB/s and the data with good spatial locality gives ~60GB/s performance.
Details on the cache are sketchy, but I have a few ideas to rationalize this behavior given testing I’ve done. What I’ve found is that TEMPORAL locality where thread i is likely to access data elements near each other IN TIME matters almost none at all. Think about it and it will make sense. You have hundreds and hundreds of threads in flight on each multiproc. By the time one completes using the data from its first memory read and goes on to the second, there have probably been enough other threads reading memory on that multiproc to clear out anything that was put into the cache for thread i.
What matters most for the thread cache is ensuring that each WARP accesses elements that are nearby each other, as they are accessing this memory simultaneously. Perhaps it is helpful to ensure that nearby warps in the indexing scheme access elements nearby each other too, but I’m not sure. The order of warp execution is undefined so who knows what the memory access pattern will be in that case. I haven’t done any tests to corroborate this, though.