Image labelling with CUDA

Hi all,

I’m trying to implement a connected components labelling on the GPU. So far, I’ve implemented the crux of the labelling algorithm which runs considerably faster on the GPU, compared to the CPU implementation. However, I’ve sort of run out of ideas on how to solve the problem below:

After a 4-connected labelling procedure I end up with an array like this (0 = background pixels):

0 0 0 0 0 0 0 0

0 2 2 0 0 0 0 0

0 2 2 0 0 0 0 0

0 2 2 0 0 0 0 0

0 0 0 0 6 6 0 0

0 0 0 0 6 6 0 0

0 0 0 0 6 6 0 0

0 0 0 0 0 0 0 0

But, I want to adjust the labels [2,6] to [1,2], so the output image looks like this:

0 0 0 0 0 0 0 0

0 1 1 0 0 0 0 0

0 1 1 0 0 0 0 0

0 1 1 0 0 0 0 0

0 0 0 0 2 2 0 0

0 0 0 0 2 2 0 0

0 0 0 0 2 2 0 0

0 0 0 0 0 0 0 0

I’ve had a couple of ideas involving sorting the initial labels, but with little success. I would really appreciate any ideas or suggestions. Many thanks for reading.

cheers

I outlined a rough algorithm strategy for connected components here. I’d probably attack it the same way now.

After reading that again now (that’s my post from 8 months ago) one algorithm tweak I might do would be to pad the length of the shared memory array to an odd number deliberately to break up the bank conflicts, making both horizontal and vertical scanning bank-conflict free.

as for the relabelling, the strategy depends on how large your labels get. If you have less than a few thousand, you could make an existence table in shared memory (init to all 0, when you see an ID, set that index value to 1.) Then do a compaction on it, giving you an efficient remapping.

If you have more than a few thousand labels, it’s harder since you can’t hold them all in a block at once, but still possible. You might just do a sort on the labels, even with redundancies, then look for the transitions to clump the repeats together, then do a compaction on that final list.

A way to work around the shared memory limit problem is let different blocks to load different parts of the labels, and let all blocks reading from the same memory location. I’m not sure whether the GPU is smart enough to avoid reading from the same location multiple times, but anyway if you have enough number of physical blocks (for example, a GTX 280 has 30 blocks) you can handle a much larger number of labels.

Hi guys,

Many thanks for the replies with great ideas. SPWorley, thanks for the link to your implementation.

I’ve done the relabelling using atomic functions. I have an array (as big as the image) and I only flag the pixels:

array[LABELn - 1] = 1

So, the array looks like this:

0 1 0 0 0 1 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

Then, I apply the following kernel on this array to have incremental labels:

__global__ void label_increment_kernel(unsigned int* source, int *count, const uint2 image_size)

{

	int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	unsigned int index = y * (image_size.x) + x;

	int src_val = source[index];

	

	if(src_val > 0)

	{

		int val = atomicAdd(count, 1);

		source[src_val-1] = val;

	}

}

I end up with something like this:

0 1 0 0 0 2 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

Which I then use to remap the non-sequential labels. It runs perfectly fine.

However, the problem is, there is no guarantee that the labels I generate using the atomic function will be in raster order, which is a requirement for the reproducibility of results. The results I get are different for every run, as the first label is assigned to the thread that gets scheduled ahead of others.

I might have to go down the sorting path I guess. Suggestions…

cheers