Help: Shared memory vs. Caching in ConvolutionSeparable Example

Hello!

My question concerns the use of shared memory versus caching doing texture lookups. In particular, I refer to the ‘convolutionseparable’ example. In it, a convolution is done on an input array. The array is convoluted by first doing the rows, then the columns (for caching speedups). This is done in two functions successively. In each function, the corresponding data rows/columns are first read into a shared memory region, which is then used for doing the convolution. My question now is, how this speeds up things? Isn’t that doubling the effort? If I read out the texture values anyway, why not compute the convolution values fot that row directly on it instead of first buffering the values in some shared segment, synching threads and then doing all the calculation?
Since I am fairly new to Cuda I apologize in advance if this might be some kind of ‘noobish’ question, but hope some of you can clear my mind about it :-)
Thanks a lot!

All threads must access each value in shared memory once. So instead of doing 2KERNEL_RADIUS reads from global memory you do one read and then do 2KERNEL_RADIUS reads from shared memory, which is much elss costly.