Question about texture/shared memory enhance the computing efficiency

I am new to CUDA, after read the programming guide 1.0, I got confused for texture memory and shared memory. From the document, it claimed that using shared memory can much improve the computing efficiency, and it also mentioned that Reading device memory through texture fetching can be an advantageous alternative to reading device memory from global or constant memory.

My question is, which one is better? Can I combine them to get better performance? Now I refered the sample code provided by NVIDIA and successfully to execute median filter with the method reading device memory through texture fetching. Just wander if there any way to get higher performance.

Thank you for your reply.

Shared memory can be used as a very effective cache when a block of threads cooperate because they can read/write to the same fast shared memory pool. When reading values into shared memory, they can be read from constant, texture, or normal global memory: it doesn’t matter.

Texture access to global memory is useful when you simply cannot change your algorithm to get coalesced global memory reads. It is still of the utmost importance to have all texture reads within a warp have good spatial locality in the memory to get good performance, though.

One isn’t better than the other, they are different and complementary. Usually you can look at a given application and solidly conclude that one is more suitable than the other. There’s isn’t that much overlap since shared memory is on-chip and instanced by the hardware on a per thread block basis, while texture aliases memory that was allocated by the application.

They can be combined for better performance. The sobelFilter application in the SDK uses texture to stage pixel data into shared memory to take advantage of data reuse. The pixel accesses from shared memory are much lower latency than they would be from texture, and shared memory can deliver more data per clock.

Occasionally, you might find applications (like lookup tables) where the tradeoffs aren’t quite as clear. When that happens, often one of the resources is otherwise underutilized by the kernel so you tend to migrate to the other. If either is suitable and you really don’t know which is better, you should make the determination empirically; but I’d expect that to be rare.

Thank you for your reply, it is clearer now.