Optimal way to handle your grid size not matching your problem set

I’m writing a simple box blur program that takes an image. I am using block sizes that are a multiple of 32 where blockIdx.(x|y) * blockDim.(x|y) + threadIdx.(x|y) represents the x and y coordinates of each pixel in the image. blockIdx.z can be 0, 1, or 2 and those indices correspond to the RGB for each pixel.

As the box blur algorithm references nearby pixels, I need to solve the problem of when I am computing an edge pixel since the nearby pixels can result in referencing an element outside of the array. Originally, I did this via clamping to the image’s min and max width and height. While removing conditionals and for loops to improve efficiency, I ended up padding the image array prior to passing it to the kernel (I realize to measure total efficiency, I need to consider the time it takes to add this padding which might actually make the overall process take longer even though the kernel code runs faster, but let’s ignore this fact for the sake of practice and theory.) Now I can perform my algorithm on the image without any conditionals since previously out of bounds indices now fall within the padding.

However, I now have a problem when my grid’s width and height don’t exactly match my input image’s width and height

Is there a preferred, performant way to handle this scenario? At the very least, my current implementation only requires checking my index against the max width and heights and doesn’t require checking if I’m accessing a negative index, but I’m curious if there are any clever ways around this or a totally different way to think about this problem.

I think the usual suggestion here is a grid-stride loop. It can be formulated with a 1D kernel design or a 2D kernel design.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.