Texture cache filled at first cache miss ?

Hi every one,

I 'd like to force the texture cache to be filled from a specified texture linear memory location.

Because I randomly access texture locations which are spatially close (under a 8KB area), but not forcibly from the beginning of the memory to the end. So I can have lot’s of cache miss if reads go from the end to the beginning.

Do I have to do a single read at the beginning of the memory location to have all the following 8KB (per multiprocessor) of texture memory be cached ?

Thanks

A typical cache stores data that surrounds data that you’ve recently fetched. E.g when you request 1 32-bit word, it may store a few words to the left of it and a few words to the right. a single fetch will not completely fill up the cache with data.

What you’re describing is basically prefetching (or warming up) a cache with a lot of data that you expect to use later multiple times. It’s highly unlikely that the texture cache will work that way.

If you know up front how you’re going to access your memory data, it’s usually much better to store it in the shared memory first and fetch it from there. If you’re already using the shared memory for something else and really think warming up the cache can help your algorithm, you really should try some experiments. A texture cache is optimized for 2D accesses, so linear fetches may not give the expected results.

Tom

Have you actually benchmarked this? I would be surprised if it made a difference. The texture cache doesn’t work quite the same way as a CPU cache which performs better in a forward mode because prefetches are the next cache line.

I tested it in my app anyways to be sure. It runs 64,000 indpendant loops through a data array of 64,000 elements. Each loop reads in 100 elements from the array on average.

Time for one full iteration:

Completely random distribution of data accesses:

0.005549976 s ← sorted beginning to end

0.005540606 s ← sorted end to beginning

“sorted” data so that data accesses have spatial locality within warps

0.003525679 s ← sorted beginning to end

0.003536046 s ← sorted end to beginning

The differences between going beginning to end and end to beginning are clearly within the noise. To give you a feeling of the memory performance of my app, the randomly distributed data leads to memory read performance ~40GB/s and the data with good spatial locality gives ~60GB/s performance.

Details on the cache are sketchy, but I have a few ideas to rationalize this behavior given testing I’ve done. What I’ve found is that TEMPORAL locality where thread i is likely to access data elements near each other IN TIME matters almost none at all. Think about it and it will make sense. You have hundreds and hundreds of threads in flight on each multiproc. By the time one completes using the data from its first memory read and goes on to the second, there have probably been enough other threads reading memory on that multiproc to clear out anything that was put into the cache for thread i.

What matters most for the thread cache is ensuring that each WARP accesses elements that are nearby each other, as they are accessing this memory simultaneously. Perhaps it is helpful to ensure that nearby warps in the indexing scheme access elements nearby each other too, but I’m not sure. The order of warp execution is undefined so who knows what the memory access pattern will be in that case. I haven’t done any tests to corroborate this, though.

Hi thanks, I do agree with you on the forward and backward neighborwood accesses, it was an initial thought and did not test it, so thanks for doing it, now we are sure.

I’m ok with you to for ensuring locality within all the threads of a warp, but I’ll go further and think that one have to ensure locality in all the threads of a block if one want to keep texture data in cache longer, because the scheduler will active the next warp as soon as the first has launched its read, so the time locality is very high between different warps. the only way to do this is by using __syncthreads after a read to ensure all the threads within the block have done their reads. I think that a synchronous block read is necessary because the cache is per MP.

The problem here is that, if we go further, we should synchronise all the threads on the same MP and in some case, if Block nb > 16, would have to synchronise blocks on the same MP, which is currently impossible on CUDA before the termination of the kernel. Nevertheless I think that as there is more temporal locality in warps in the same block than in warps in different blocks in the same MP, perhaps synchronisation within the same block could be sufficient. It has to be tested. I’ll try this.

If developers from NVIDIA who have the knowledge of the texture cache behaviour, filling patterns etc… (and why not constant cache), their help and explanations would be very welcome and appreciated =).