FloodFill

I’ve spent a bit of time trying to implement flood fill in GPU. The main reason is that we have a pipeline of imaging algorithms we need to run, and we do not want to download an image to CPU, run CPU flood fill, and then upload back to GPU for further processing.

The main difficulty is that this algorithm does not seem to parallelize easily. You start at a seed point, and do not really know how the flood fill will grow, so it is not straightforward to divide the image up and have thread groups process different sections of it.

As a first attempt, I wrote a kernel that uses 1 thread block with 512 threads, and it processes a column at a time where each thread processes an element in the column and does a vertical scanline type flood fill (yes for current prototype, image height is capped at 512). The kernel loops to march left/right until it reaches the image bounds. For what I will call a medium sized fill, on my GTX 1070, this took about 0.5 milliseconds.

To be honest, the limited parallelism I’m getting is not much, and there is a lot of redundancy in the vertical scan fills, as most threads will write the same value to the output. I use local memory to at least avoid redundant writes to global memory.

Does anyone have any good ideas on how to implement this better on GPU? Thinking out loud, but maybe I can use two thread blocks. Using the same algorithm as above, one will handle processing the columns to the right of the seed point, and the other processing columns to the left of the seed point.

Do a scan in column direction first, then handle each line by a separate warp? That should expose a lot of parallelism pretty quickly. Key to good performance would be extensive use of __ballot() and __clz().

Maintain a stack in shared memory via atomic functions for points still to scan from, and repeat the alternating direction scan from there. Seems to me flood fill should parallelize reasonably well.

I haven’t used a flood-fill in thirty years, so I am also just thinking out loud here. Are these binary or gray-scale images? What is a typical size? How time-consuming is this step relative to the rest of the processing pipeline? A poorly parallelized GPU implementation may be OK if it avoids moving the data to the CPU and back.

As I recall, single-threaded iterative versions use a queue. In each step, pull a pixel off the start of the queue, color it, now check its four neighbors (assuming 4-connectivity), and add any colorable pixels discovered to the end of the queue. Repeat until queue empty.

This could be parallelized by assigning each thread to a separate start pixel, and have it operate its own queue. The search area for each thread can be limited by tracking the distance of pixels examined to the start pixel. Storage for the queue could be an issue (e.g. if shared memory is used for that), and would likely limit the parallelism.

A few questions on this method:

  1. Would you use multiple thread blocks (and then multiple dispatches to sync to a global work queue between passes)? How do you syncronize writes to global memory from different thread groups? Do atomics sync across thread blocks?

Or would use one thread block and use __syncthreads()?

  1. If using multiple thread blocks and multiple kernel passes, how do you know when to stop dispatching kernels? Obviously, when the work queue is empty, but won’t that require a CPU readback of the queue count?
  1. I would start my implementation with a single thread block and see how that fares. I'd suspect that the problem is not large enough to warrant more effort.

    However Njuffa made an important point that for correctness you’ll need a spill mechanism from your queue or stack in shared memory into global memory, to allow handling pathologic cases. This should rarely be used in practise if you optimise your algorithm a bit beyond just pushing all neighbors into the queue unconditionally. But if you try to flood-fill complicated patterns or labyrinths, shared memory can indeed be too small. And once you have the mechanism in place to spill into global memory, it might not be much more work to have other blocks take work from there.

    Atomics in global memory are atomic from any thread on the device. No atomic operation syncs however. You’ll have to use memory fence instructions and atomics to synchronize access between threads

  2. Invoking multiple kernels might be an easy way to synchronize access to the queue in global memory, but it would essentially defeat the purpose of using multiple blocks (if you launch more than one block per kernel, you would need to synchronise between them anyway). I would stay with a single kernel launch.

Looking at this more, I think I will need to use global memory for queue/stack. Even a simple example like flood filling a rectangle, that is say, 200x200 will add a lot of nodes to process.

My plan is to write the nodes to global memory, then consolidate the nodes into scan lines for processing in the next iteration.

Question is: If using one thread block, is __syncthreads() enough to synchronize global memory writes. That is:

  1. ProcessRows(); // This writes to global memory.
  2. __syncthreads();
  3. BuildVerticalScanLinesForNextPass(); // <-- this will read from the updated global memory, is it safe?
  4. ProcessColumns();

Yes, __syncthreads() is effective for global memory as well, not just shared memory.

Just like Njuffa,it’s three decades for me as well that I wrote a flood fill. However the simple algorithm I came up with only needed a stack size of 2 to fill a rectangle from an arbitrary interior point - one above the current scanline and one below.

It wasn’t the most advanced algorithm, but in the end even got used in two commercial products that had evaluated a lot more complicated algorithms, for it used about a hundred bytes of machine code and memory was at a premium those days.

Thanks for your help and quick replies. I must be misunderstanding something. I get the column direction first. Say the rectangle is 100x100 pixels. So now I filled that column and I have 100 rows to process in the next iteration. Isn’t that stack count 100 now? A thread is assigned to each row, and each row will add 99 columns to process for the next pass. Of course we are done now, but seems stack count grew a lot.