Two questions about the SDK sample projects

When I read the sample of “convolutionSeparable”, in its sample doc convolutionSeparable.pdf, I was confused by two questions.
One is in the page 9: a image is divided into several blocks, when we extend the Apron, how to deal with the left and right edge? take the Figure 4 as a example, how we get the left and right edge data?
the another question sits in the page 10, the second paragraph says that:
As an example, consider a 16x16 image block and a kernel of radius 16. This only
allows one active block per multiprocessor. Assuming 4 bytes per pixel, a block will use 9216 bytes. This is more than half of the available 16KB shared memory per multiprocessor on the G80 GPU. In this case, only 1/9 of the threads will be active after the load stage shown in Figure 5.
and my question is: a image block is 16×16, and 4 bytes per pixel, then a block should use 16×16×4 = 1024byte, how comes the 9216byte?

Well due to the nature of convolution you need a 15+16+15 by 15+16+15 input block for an output of 16x16 (given a kernel size of 16x16). They rounded this up to 48x48 for efficiency, so this amounts to 48x48x4, which is indeed 9216.