Why tex1Dfetch faster in 10-15 times then a global memory ?
Because the data is being cached and reused by multiple threads possibly across multiple blocks.
If you are running a Fermi GPU one would expect the much larger L1 cache to do the same job for you automatically. What is your GPU?
It is cached on the chip, so the data is closer to the cores. But this happens for cards with cc smaller than 2.0 For the Fermi there is a L1 and L2 cache and it can be faster in many situations not to use textures, but it can still help for not coalesced acesses.
Do you know of any good theoretical comparisons for CC 2.x where they’ve specifically benchmarked using texture cache vs L1 & L2 cache ?
The texture cache could potentially be much faster for interpolation but I wonder if it might also be more efficient when all blocks are reusing a small amount of data extensively ( << 8 KB ).
I think it is mentioned in the Fermi tuning guide. They are comparing the speed of the texture cache to the speed of L1, so it is more the theoretical speed.
Yeah they dont really go as in depth on the subject as one would like:
Fermi tuning guide:
It depends a lot on the problem and sometimes when lots of data is cached using textures can still improve performance, by reducing some of the L1 cache.