relative speed tex1Dfetch() v buffering in registers

I have a texture of 20 floats (4 by 5 array). Since it is so small, I am expecting it all to
be resident in cache. (However there are much bigger textures as well.)
I use the values it contains about 16 times per thread. Any ideas when it is
worth buffering by reading into a local variable and using that rather
than calling tex1Dfetch() every time?
The texture is treated as a 2D array–which involes some index calculation.

Any help or comments would be most welcome

If you’re on a Kepler device you could load the float[20] into a single register per lane and then use a SHFL instruction to randomly access the first 20 lanes of the warp.

Dear Allan,
Thanks for this suggestion.
Although adjacent threads will tend to be using adjacent elements of the array, this is not
gauranteed. At present the code does not require communication between the threads.
At present the code assumes each thread can randomly read the float it wants, regardless
of the other threads. Does SHFL force threads to cooperate?

The “randomly accessed small array” idiom allows all lanes in a warp to fetch any element from a small array held in a single register per lane. SHFL is a warp operation and executes with low latency, reasonable throughput and no warp divergence.

Benchmarking is the only way to be sure.