Block-based model for texture caching in CUDA

I’m a little confused about the caching mechanism on GPU. The paper “Cache-Efficient Numerical Algorithms using Graphics Hardware” presents couple of cache-efficient algorithms on NVIDIA hardware:

I need a little clarification on the ideas presented in the paper. I would appreciate if we can have a discussion about this in this forum:

Is this something that is done by the underlying caching mechanism in CUDA? How can the block size be specified?

In general the sorting algorithm presented in the paper performs rendering of row-aligned quads of height h and width W, and a cache-efficient algorithm is presented that maximizes cache utilization for given block and cache sizes, quoted bellow:

So, the block-based idea was discussed earlier in the paper, regarding the block-based fetches of BxB blocks (my first quote). So, I don’t understand how the following idea differs from that first one other than specifying a different block size for each fetch from the device memory? Am I missing the main point here?

How is it possible to change the block size? is it something that can be controlled by software in CUDA?

No, it’s not possible to change any aspect of the texture cache behaviour from CUDA.

We don’t usually disclose many details of the texture cache, mainly because it will likely change in future hardware.

The most important points about the cache are - it is 2D, very small (effectively about 6-8Kb per multiprocessor), and you get benefit by having all the threads in a warp access nearby locations in the texture (as demonstrated in that paper).

Don’t overlook this one. It is the most important!

Another tip is to pack data up if you can. A single float4 texture read is faster than 4 separate float texture reads.


Do you have a figure as to how much faster that would be, reading float4 instead of float??

Would it still be faster even if I indeed read neibhouring cells?



That entirely depends on the application. It has been so long since I did the comparison in my app, that I don’t remember the exact numbers. It was at least 50% faster, but probably more. You’ll just have to benchmark your app both ways and find out.