Is there a faster algorithm or implementation of a VoxelGrid Filter with Thrust?

soadster · November 29, 2020, 7:30pm

Hi guys, I am quite new with CUDA Programming so bear with me, any comment to improve my codestyle or coding is appreciated.

I am currently trying to implement a VoxelGrid Filter for pointclouds using CUDA or Thrust. Currently I tried to follow the algorithm of the PCL Voxelfilter, so following steps:

Calculate voxelgrid indices for each point.
Sort the points and indices using those indices
Count the number of unique indices
Count number of points per index
Add up XYZ and squared RGB values of the points with same index
Divide the accumulated values with the weights and sqrt the RGB values

This, is sadly not as well performing as hoped, so I hope the community might have some suggestions on speeding up this process further.

struct PointDiv {
    __host__ __device__
    Point operator()(const uint32_t& w, const Voxel& v) {
        Point p;
        p.x = v.x / w;
        p.y = v.y / w;
        p.z = v.z / w;

        uint8_t* rgb = (uint8_t*) &p.rgb;
        rgb[2] = round(sqrt(v.r / w));
        rgb[1] = round(sqrt(v.g / w));
        rgb[0] = round(sqrt(v.b / w));

        return p;
    }
};


uint32_t voxel_grid(uint32_t num, thrust::device_vector<Voxel> d_voxel_cloud, float* out) {
    thrust::device_vector<uint32_t> voxel_idxs(num);

    // Step 1: Produce Indizes
    thrust::transform(d_voxel_cloud.begin(), d_voxel_cloud.end(), voxel_idxs.begin(), PointToKey());
    
    // Step 2: Sort by Idxs
    thrust::sort_by_key(voxel_idxs.begin(), voxel_idxs.end(), d_voxel_cloud.begin());

    // Step 3: Count Amount of Voxels
    // number of histogram bins is equal to number of unique values (assumes data.size() > 0)
    uint32_t num_voxels = thrust::inner_product(voxel_idxs.begin(), voxel_idxs.end() - 1, voxel_idxs.begin() + 1, 1, thrust::plus<uint32_t>(), thrust::not_equal_to<uint32_t>());

    thrust::device_vector<uint32_t> d_weights(num_voxels);
    thrust::device_vector<uint32_t> d_idx_reduced(num_voxels);

    // Step 4: Produce "Histogram" for weights
    thrust::reduce_by_key(voxel_idxs.begin(), voxel_idxs.end(), thrust::constant_iterator<uint32_t>(1), d_idx_reduced.begin(), d_weights.begin());

    // Step 5: Merge all values with same idx
    thrust::device_vector<uint32_t> d_idx_after_vox(num_voxels);
    thrust::device_vector<Point> d_point_cloud_out(num_voxels);
    thrust::reduce_by_key(thrust::device, voxel_idxs.begin(), voxel_idxs.end(), d_voxel_cloud.begin(), d_idx_after_vox.begin(), d_point_cloud_out.begin(), thrust::equal_to<uint32_t>(), thrust::plus<Voxel>());

    // Step 6: Divide by weight
    thrust::transform(d_weights.begin(), d_weights.end(), d_point_cloud_out.begin(), d_point_cloud_out.begin(), PointDiv());

    thrust::copy(d_point_cloud_out.begin(), d_point_cloud_out.end(), out);

    return num_voxels;

}

Robert_Crovella · November 29, 2020, 9:50pm

You might get better help if you provided a complete code that someone else could compile and run to measure performance. Note that I’m not suggesting you provide your whole code. The ideal here would simply be a test framework around what you have shown - provide the missing definitions, like Point, Voxel, PointToKey, etc., and provide a simple unit-test that sets up some data, calls your voxel_grid routine, and provides a timing measurement, for a data set size that is relevant for your usage. I usually recommend with performance questions that people also supply the GPU they are running on, the compile command line, and the CUDA version and OS they are using. It may also be useful to indicate what your performance target is.
Have you done any profiling to see which thrust operations are using the most time? My guess is it is the sort function.
Step 3 seems mostly unnecessary, although I doubt it is a major contribution to the overall time. Can’t you get the size from the reduce-by-key in step 4?
Your Point and Voxel appear to be structures, you seem to be using a AoS data organization scheme. Thrust generally works better with SoA, although I don’t know how much it would matter here. Thrust sort operations that are encumbered by moving big structs around during the sorting can sometimes be improved by sorting on just the relevant item, creating a rearrangement index, and then doing a thrust permuted copy to rearrange the items after the sort. See here

Topic		Replies	Views
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16951	May 30, 2009
Cuda & Voxel CUDA Programming and Performance	0	453	March 18, 2019
Thrust v1.3 release C++ Template Library for CUDA CUDA Programming and Performance	1	3165	October 5, 2010
Thrust reduce_by_key not as fast as expected CUDA Programming and Performance	0	1132	February 23, 2015
Accelerating a Gridder CUDA Programming and Performance	12	4247	August 22, 2008
Cluster siliarity calculated with usage of array reduction optimization CUDA Programming and Performance cuda , kernel	8	455	November 19, 2022
Different approaches to rasterization for large data sets CUDA Programming and Performance	4	333	February 27, 2024
Point Sorting into Cells CUDA Programming and Performance	9	3536	June 23, 2009
Sorting particles into cells CUDA Programming and Performance	4	8954	February 1, 2011
CUDACasts Episode 16: Thrust Algorithms and Custom Operators Technical Blog	2	397	February 14, 2014

Is there a faster algorithm or implementation of a VoxelGrid Filter with Thrust?

Related topics