Optimizing Image Labeling Connected Component Labeling

I am attempting to solve the Connected Component Labeling problem with CUDA. The problem is to take a two-dimensional matrix (of ones and zeros) and apply a unique label to each “blob” of ones. For example, the input:

[1 1 0 0

1 1 0 0

0 0 0 0

1 0 1 1

0 0 1 1

0 0 1 1 ]

would produce an output like:

[1 1 0 0

1 1 0 0

0 0 0 0

2 0 3 3

0 0 3 3

0 0 3 3 ]

Here is my strategy (with the time it takes to process a HD image):

    (1.5 miliseconds) Label each non-background pixel in the image with its index

    (20 miliseconds) For each pixel in the image, assign it to the minimum of its neighbors until no more assignments have been made

    (4 miliseconds) Reduce the blob labels to go from 1…n (instead of being based off of large indices) where n is the number of blobs

I am trying to reduce the time it takes to process the middle step, but am stumped on how to do that. My CPU can calculate HD images in about 23 miliseconds, so my current implementation is actually slower!

The CPU loop (kernel code attached in text file) for step 2 is:

while (1)


  // Make each pixel take the minimum label of itself

  //   and its neighbors

  minLabel<<<numBlocks, BLOCK_SIZE>>>(flag, output, numPixelRows, numPixelCols, numPixels);



  // Load the flag into CPU memory

  cudaMemcpy(&localFlag, flag, 1, cudaMemcpyDeviceToHost);


  if (localFlag == 0)


	// The image has not changed

	//   labeling has completed


  } else {

	// The image has changed

	//   labeling is not complete

	localFlag = 0;



  // Reset the flag and sent it back to global GPU memory

  cudaMemcpy(flag, &localFlag,  1, cudaMemcpyHostToDevice);


I’ve attached the kernels as a text file.

Does anyone have ideas of how I can speed up my code?

Should ditch my current strategy and try another?

Is this just a problem that cannot be solved easily with CUDA?
labeling.txt (4.02 KB)

Just had a quick scan:

  • speeding up something that takes not very long on CPU is not always feasible, your code does not look very arithmetic intensive, nor does it need lots of memory. Your dataset probably fits in the CPU cache.

  • imageIdx could be a (2D) texture as far as I can see

  • The following could be much simpler if you do a 2D grid & 2D block. Then you do not need the expensive % and / (integer) operators

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int i, j;
    unsigned int minLabel;
    unsigned int nPixel, sPixel, ePixel, wPixel;
    unsigned int nwPixel, nePixel, swPixel, sePixel;

    i = idx % numRows; // Current Row
    j = idx / numRows; // Current Column

    unsigned int isNotAtTop = (i != 0);
    unsigned int isNotAtBottom = (i != numRows-1);
    unsigned int isNotAtRight = (j != 0);
    unsigned int isNotAtLeft = (j != numCols-1);
    unsigned int writeFlag;

Your algorithm speed is likely super-data dependent. If you have some long thin squiggly line snaking around, you basically would need to run one iteration for every pixel along the length of the line! That may be your big slowdown.

Since this is a multipass algorithm, I’d try to minimize global memory reads and writes. Break up your image into 32 by 32 pixel regions. Have each block load one of those 32 by 32 tiles into SHARED memory. Now, have each block solve the connected labelling problem for that 32 by 32 subsquare.
This subsquare solution is still a challenge. I’d initialize by taking vertical strips down the pixels, doing sumple initial classifications first… (ie, look at the three pixels above the one you’re working on now and use their minimum non-background label. Otherwise make a new label.) Vertical column processing is MUCH faster since you’d get no shared memory bank conflicts when reading or writing.
After the first intialization pass, run again from top to bottom, but now you look at all your neighbors, not just the top three. Repeat this iteration until you’ve converged.

This iteration still has the “snake” problem of many iterations needed for problem situations, but with a 32 by 32 size it’s less likely to be a big problem.

Now you have each 32 by 32 subsquare solved, but obviously there’s connections between the tiles where regions have different IDs! How to handle those?
You COULD run the same algorithm yet again and offset the tile positions, so you’d create tiles like (16,16) to (48,48), allowing “intercommunication” between the four original tiles that you overlapped, and then a third iteration to cover the final cracks, but then you have the same “snake” problem of a distant tile affecting your final ID.

A completely different (and better) idea may be to find EQUIVALENT IDs, lists of IDs that label the same regions. This might be computed by running down the “seams” of the tiles, seeing which tile regions on one side are connected to the neighbor, and making a BIG LIST of ID pairs, “ID 18171 is equivalent to 27262”. Then, you build a hash table of these renumbering pairs (perhaps on the CPU), following any chains such that the entry for any ID gives a “new” ID of the lowest number found for that group. Then on the GPU a final pass to look up into this hash table to renumber the pixels with their final group number is all you need.
The ugly step is the “run down the seams and make equivanent ID pair lists” but it’s quite reasonable.

OK, that’s it for brainstorming at 2:30 am. It’s certainly something CUDA can do.

If you want to keep your current approach of looking at neighbor pixels and choosing the minimum value:

Copy small rectangular segments of your source image into shared memory. Within each thread block, perform your algorithm on the shared memory segment until no more change is detected. Then copy the result back to global memory. Make sure that reading from and writing to global memory is done in a coalesced way. Never access any pixels across of your current segment boundary.

A thread block should terminate when the segment reports no change during an iteration. The kernel terminates when all segments remained unchanged.

Then in your outer CPU loop you run the kernel again, this time offsetting the block boundaries by half the segment dimensions (your grid may become slightly smaller this time). This means that now you will work across the segment boundaries used by the previous run.

Repeat the entire process until no more change is detected.

I don’t think the overall approach is very efficient. But by using shared memory you will save a lot of memory bandwidth. I’ve used a similar idea in an odd-even sort algorithm and it provided a nice speed up compared to the original implementation with global memory access only.

EDIT: oops, I missed SPWorley’s reply. He explained it very nicely as well.


How about using some parallel algorithms for CCA like the Local Operator method which can label each sub-image in a block and then merge the boundary label in another kernel or in the CPU.

Hey guys, I am going to try this CCA using CUDA. Has somebody already implemented it? I do not want to reinvent the wheel.


has someone in this forum achieved a fast CUDA CCA implementation?


there are some guys out there…



Maybe this will be also interesting:…gTechniques.pdf


I did parallel CCL for my dissertation during my masters. I used a divide and conquer approach. The speed up was somewhere around 5 - 7 times. Search for the paper titled "Parallel Blob Extraction Using the Multi-core Cell Processor ", it has a parallel implementation of CCL on CBE. I did something similar using GPUs. Trying to get a paper out on it.