I’ve spent a bit of time trying to implement flood fill in GPU. The main reason is that we have a pipeline of imaging algorithms we need to run, and we do not want to download an image to CPU, run CPU flood fill, and then upload back to GPU for further processing.
The main difficulty is that this algorithm does not seem to parallelize easily. You start at a seed point, and do not really know how the flood fill will grow, so it is not straightforward to divide the image up and have thread groups process different sections of it.
As a first attempt, I wrote a kernel that uses 1 thread block with 512 threads, and it processes a column at a time where each thread processes an element in the column and does a vertical scanline type flood fill (yes for current prototype, image height is capped at 512). The kernel loops to march left/right until it reaches the image bounds. For what I will call a medium sized fill, on my GTX 1070, this took about 0.5 milliseconds.
To be honest, the limited parallelism I’m getting is not much, and there is a lot of redundancy in the vertical scan fills, as most threads will write the same value to the output. I use local memory to at least avoid redundant writes to global memory.
Does anyone have any good ideas on how to implement this better on GPU? Thinking out loud, but maybe I can use two thread blocks. Using the same algorithm as above, one will handle processing the columns to the right of the seed point, and the other processing columns to the left of the seed point.