I am trying to read some section of an image into shared memory. Each thread of my kernel is working on a pixel in the image. my filter radius is 3, so I want to read into shared memory the 3 rows before and after the row currently processing, and +3 columns and -3 columns as the width of my thread block.
To keep every thing in bounds, I have checks to see if the threads are near the edges of the original images, and they simply return without doing anything.
How can I put this 7xcols band into shared memory, when I have to use syncThreads(), and yet some of my threads are going to return, because of that boundary condition. What is the proper way to handle this? Every thread has to reach the syncThreads() call right?