Separable Convolution in the SDK

I have a question regarding the separable convolution example in the SDK.

In the example they have just random values in the data array which they perform convolution on. They put these values in the array after aligning the width to a multiple of 16 (for memory coalescence) and don’t really need to worry about about what happens to the memory arrays when you align the width.

If you wont to use this method to for example blur a picture and the picture width is not a multiple of 16. When you copy the picture from host memory to device memory you need to pad each “row” in the array with the value you align up with right? Or did I interpret the example wrong?

Say for example that you have a 2d image represented as a 1d array and you want to use this method to blur the image. To make this really easy say you have a image of width 12 and height 2 pixels. The width will be aligned to 16. Then you need to pad the picture array with 4 pixels for each row before copying the picture to the device right? Like this:

Original picture:
Array elements: | 0 1 2 3 4 5 6 7 8 9 10 11 | 12 13 14 15 16 17 18 19 20 21 22 23 |

Picture on device memory:
Array element: | 0 1 2 3 4 5 6 7 8 9 10 12 13 14 15 | 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |

Do I need to manually loop through my original picture copy every row to a new array that has a row size of 16, use that with the convolution kernel and when it’s all finished do the same thing in reverse?

As I’m thinking about using this implementation for gaussian blur of 3d volumes with a regular size of 512x512x512 but might even be up to 1000x1000x1000 this means a lot of looping and unnecessary copying I hope someone will just tell me that I looked at the code all wrong, got the example backwards and that there is simpler way to do it :)