Is there a faster algorithm or implementation of a VoxelGrid Filter with Thrust?

Hi guys, I am quite new with CUDA Programming so bear with me, any comment to improve my codestyle or coding is appreciated.

I am currently trying to implement a VoxelGrid Filter for pointclouds using CUDA or Thrust. Currently I tried to follow the algorithm of the PCL Voxelfilter, so following steps:

  1. Calculate voxelgrid indices for each point.
  2. Sort the points and indices using those indices
  3. Count the number of unique indices
  4. Count number of points per index
  5. Add up XYZ and squared RGB values of the points with same index
  6. Divide the accumulated values with the weights and sqrt the RGB values

This, is sadly not as well performing as hoped, so I hope the community might have some suggestions on speeding up this process further.

struct PointDiv {
    __host__ __device__
    Point operator()(const uint32_t& w, const Voxel& v) {
        Point p;
        p.x = v.x / w;
        p.y = v.y / w;
        p.z = v.z / w;

        uint8_t* rgb = (uint8_t*) &p.rgb;
        rgb[2] = round(sqrt(v.r / w));
        rgb[1] = round(sqrt(v.g / w));
        rgb[0] = round(sqrt(v.b / w));

        return p;
    }
};


uint32_t voxel_grid(uint32_t num, thrust::device_vector<Voxel> d_voxel_cloud, float* out) {
    thrust::device_vector<uint32_t> voxel_idxs(num);

    // Step 1: Produce Indizes
    thrust::transform(d_voxel_cloud.begin(), d_voxel_cloud.end(), voxel_idxs.begin(), PointToKey());
    
    // Step 2: Sort by Idxs
    thrust::sort_by_key(voxel_idxs.begin(), voxel_idxs.end(), d_voxel_cloud.begin());

    // Step 3: Count Amount of Voxels
    // number of histogram bins is equal to number of unique values (assumes data.size() > 0)
    uint32_t num_voxels = thrust::inner_product(voxel_idxs.begin(), voxel_idxs.end() - 1, voxel_idxs.begin() + 1, 1, thrust::plus<uint32_t>(), thrust::not_equal_to<uint32_t>());

    thrust::device_vector<uint32_t> d_weights(num_voxels);
    thrust::device_vector<uint32_t> d_idx_reduced(num_voxels);

    // Step 4: Produce "Histogram" for weights
    thrust::reduce_by_key(voxel_idxs.begin(), voxel_idxs.end(), thrust::constant_iterator<uint32_t>(1), d_idx_reduced.begin(), d_weights.begin());

    // Step 5: Merge all values with same idx
    thrust::device_vector<uint32_t> d_idx_after_vox(num_voxels);
    thrust::device_vector<Point> d_point_cloud_out(num_voxels);
    thrust::reduce_by_key(thrust::device, voxel_idxs.begin(), voxel_idxs.end(), d_voxel_cloud.begin(), d_idx_after_vox.begin(), d_point_cloud_out.begin(), thrust::equal_to<uint32_t>(), thrust::plus<Voxel>());

    // Step 6: Divide by weight
    thrust::transform(d_weights.begin(), d_weights.end(), d_point_cloud_out.begin(), d_point_cloud_out.begin(), PointDiv());

    thrust::copy(d_point_cloud_out.begin(), d_point_cloud_out.end(), out);

    return num_voxels;

}
  1. You might get better help if you provided a complete code that someone else could compile and run to measure performance. Note that I’m not suggesting you provide your whole code. The ideal here would simply be a test framework around what you have shown - provide the missing definitions, like Point, Voxel, PointToKey, etc., and provide a simple unit-test that sets up some data, calls your voxel_grid routine, and provides a timing measurement, for a data set size that is relevant for your usage. I usually recommend with performance questions that people also supply the GPU they are running on, the compile command line, and the CUDA version and OS they are using. It may also be useful to indicate what your performance target is.

  2. Have you done any profiling to see which thrust operations are using the most time? My guess is it is the sort function.

  3. Step 3 seems mostly unnecessary, although I doubt it is a major contribution to the overall time. Can’t you get the size from the reduce-by-key in step 4?

  4. Your Point and Voxel appear to be structures, you seem to be using a AoS data organization scheme. Thrust generally works better with SoA, although I don’t know how much it would matter here. Thrust sort operations that are encumbered by moving big structs around during the sorting can sometimes be improved by sorting on just the relevant item, creating a rearrangement index, and then doing a thrust permuted copy to rearrange the items after the sort. See here