utilize texture memory How to use the texture more effectively?

How better can we utilize the texture memory to accelerate CUDA programs? Which aspect of texture memory is worth studying for further? How to use the texture cache more effectively?

I test a program (a simple image blending program written by myself) implemented by texture memory and shared memory, and then compare the execute time. I find that texture is 3 times faster than shared memory. However, I always think that shared memory is faster.

Let’s start a discussion!

Thanks for participating!