CuPCL on ROS2

Hi NVIDIA Community,

I am working on integrating NVIDIA’s cuPCL library, specifically the cuCluster component, into my ROS2 node. While the CUDA implementation offers potential performance benefits, I’ve encountered a challenge with memory management.

Currently, the memory overhead when invoking the cuCluster function is significantly higher compared to the standard PCL clustering implementation that our ROS2 node previously utilized. Although I’ve attempted to optimize performance by pre-allocating input and output buffers, the dynamic nature of incoming point cloud data presents a challenge for efficient memory allocation.

Specific questions:

  1. Is my approach of pre-allocating buffers a sound methodology for this use case?
  2. What strategies would you recommend for managing dynamic memory allocation while maintaining CUDA acceleration benefits?
  3. Are there best practices for reducing memory overhead in this specific scenario?

Given your expertise with cuPCL, I would greatly appreciate your insights on optimizing this implementation for a ROS2 environment.

Thank you for your time and consideration.

Best regards,
HK