Image Convolution with CUDA paper Not quite understanding the tiling method they're showing

Going through this paper:…onSeparable.pdf

I’m assuming when they refer to ‘kernel of radius 16’ it means that the kernel is a 32x32 square.

I’m a little bit confused to how the tiling method they present works using the 16x16 thread block to perform the convolution. How are all the threads active during the computation stage? If only a 16x16 image block is first loaded into shared memory, why aren’t threads idled when it tries to load the very large apron of this image block from slower global memory?

I am in the same boat.

“For example, if we use a vertical column of threads with the same width as the image block we are processing, we have a 48x48 pixel region in shared memory, but only 16x48 threads.”

I take this quote as meaning there are three thread blocks of size 16x48.

I can see why the middle 16x48 block would be the only one used in the computation state (because the other two only hold the values for the apron).

If there are 9 thread blocks of 16x16 each, won’t we be reduced back to the original problem of only using 1/9th of the total threads (because only 1 of the 9 blocks have information we care about)?

The thing is that there is always 48x48 shared memory block, but thread block has either 16x48 or 16x16 threads.
In case of 16x48 thread block each thread has to load 3 pixels - thread block window moves horizontally across the image part of size 48x48. So first it loads the left part, then jumps 16 pixels right, loads pixels and again jumps and loads. This way you get all 48x48 pixels in shared memory. Now the same thread block is used to make computations. However, only the middle part of it will be used since the upper and lower threads fall into the apron region (you can only convolve those pixels that are surrounded by 16 pixels in each direction).
In case of 16x16 thread block, each thread reads 9 pixels - thread block moves horizontally and vertically across the image part of size 48x48. All 16x16 threads are used to compute convolution in this case, since there are 16x16 output pixels surrounded by 16-pixel thick apron.

Hope it helped a bit. I’m currently working with this convolution implementation and I’m not sure of everything, too.

So one Block of the actual grid in the example resembles the apron on each side and fenced in is the data?
The Block size needs to be a multiple of the Warp size

so given im having a 3x3 filter
my block will be (3+3+3)*(3+3+3) threads, does “block size” refer to one dimension (9) or the whole size (81)?
so should my dimension b a multiple of 32 or the whole grid?

I didnt quite understand why i need additional threads (that do nothing) to fill the blocksd dimensions up so that it alligns with warp size

I also have a general question as to the size of the ROW_TILE_W in this paper, how large should it be?


Be aware that I’m not expert, but I’m pretty sure about what I’ve written.

Yes, the inner block of data corresponds to the points that are output after convolution.

You don’t have to make the apron width equal to data width. In the SDK example it just happened that kernel (filter) has radius equal to data width/height. I think that if you have a 3x3 filter you can still use 16x16 block of data, that will be output, with 1-pixel apron on each side. And also in this case I suppose that alignement to 16 pixel would be like taking a sledgehammer to crack a nut. So a shared memory data block of size (1+16+1)x(1+16+1) would work well.

Block size corresponds to the number of threads, not one dimension only. A block of dimensions 16x16 has size equal to 256.

It means that you want coalesced load for your data block (the one that is processed). This implies reading all 16 threads before it, even if the apron is smaller. That comes from the fact that your thread block reads pixels into shared memory in sequential fashion, from left to right. If you read the pixels just from the beginning of the apron whose size is smaller than 16, you would have non-coalesced loads for the whole sequence of loads.

ROW_TILE_W is the width of data that will be processed by one thread block. When you take 3x3 filter and thread block size = 16x16 (shared memory block is then equal to (1+16+1)x(1+16+1)), then ROW_TILE_W = 16.

Post further questions, should you have any doubts. I’ll try to answer a bit faster.

Could you please explain (first para page number 9 in the same document):

[b][i]Each pixel within the apron-width area on the outside of the image will be loaded 9 times because of the overlap

between neighboring blocks![/i][/b]

How exactly 9 times?

16x48 threads per block is more than 512 threads per block limit… If you are talking about cc 2.0 device, only then it is possible…

Or I am unable to understand your statement ?

If you read a 16x16 block (the grey one from document), you have to load the apron too, so you have in total nine 16x16 data blocks (8 yellow and one grey). Each pixel is loaded 9 times because these 16x16 blocks from the apron must be also considered as the center block (as a grey block) when another part of the image is processed. For example you process one 16x16 block with its apron, then you process the next block which was actually the part of the apron previously, and now you have the previous block as apron. Every block (every pixel) is loaded nine times - once as the center block being processed and eight times as part of the apron (this does not apply to the border pixels since they are loaded fewer times - the blocks in corners are loaded four times, and on the sides six times).

What would happen when the kernel size is larger than that of the tile?

I think the kernel size should be appropriately chosen as similar to the tile side. If the kernel size is larger than the tile size then we will have further complexity in convolving the image.