Centroids (weighted) & Connected Component Labeling (CCL)

We have a 8bit greyscale image with lots of star-like objects. We are currently finding the weighted centroid with Matlab using:

mask = image(:,:) > threshold;
    ccl = bwconncomp(mask);
    ctroids= regionprops(ccl, image, {'Centroid', 'WeightedCentroid'});
    centroids = cat(1, ctroids.WeightedCentroids);

We need to stop using Matlab, and go to a pure CUDA solution. The IMAGE array is already in GPU global memory (from earlier CUDA kernel process steps).

The issue I am having with the code I find (and papers on the net) is that that they are doing something that is not optimal for our problem. Which has small compact objects (no holes, no long pipes). Here is some of the places I have looked:

  1. GPU Gems (http://hpcg.purdue.edu/bbenes/papers/Stava2011CCL.pdf ) No centroiding, only CCL. Only handles square images. The pseudoCode is not clear, and I really don’t like the iterative kernel invocation. I think they need it because of large objects and odd shapes. I am thinking there are better algorithms for what we need.

  2. ArrayFire (http://www.arrayfire.com/docs/group__image__func__cpp__centroids.htm ) I like the simplicity, but there is no Weighted Centroid item. No idea what the performance is. I would love to see a explanation of the algorithm. Also they seem to assume C++ arrays, and our data is already in the GPU. I don’t want to copy it back and forth.

  3. OpenCV. I can’t find the correct routines, and I don’t need the full GPU library, just the CCL and centroids. Also it seems this library also expects the data to be in the host to begin with, and to return to the host. Can I just get a snippet of the source code?