CUDA texture memory performance

Firstly, I think the texture is cache, why the CUDA user manual said that the speed of transfer shared memory is faster than the speed of texture?

Secondly, you know the texture memory just about 8k per MP, so the total texture cache amount is about 80K, but the texutre image size is about 2^16*2^16, my question is how occupy space of one texel? The user manual said that there is a situation that misses texture cache fetch, therefore, I think whether some texture store in device memory.

Thirdly, there is another problem, somebody who familiar with the example of convolution_Texture in CUDA SDK, but the performance is much lower compared with ipp (intel integrated performance), can give me some advices to speed up the convolution algorithm under CUDA, and you know the separate approach can speed up largely, but you know the transfer time is too large (I mean the result of row need to transfer to device memory , and then deliver to texture again, and calculate the column convolution), the user manual said that the texture data must come from device memory, can anybody have some advices to reduce this transfer consumption, e.g. PBO (I’m not clear that the PBO can improved transfer data).