Accelerating Lidar for Robotics with NVIDIA CUDA-based PCL

Originally published at: Accelerating Lidar for Robotics with NVIDIA CUDA-based PCL | NVIDIA Developer Blog

Many Jetson users choose lidars as their major sensors for localization and perception in autonomous solutions. Lidars describe the spatial environment around the vehicle as a collection of three-dimensional points known as a point cloud. Point clouds sample the surface of the surrounding objects in long range and high precision, which are well-suited for use…

Hi, jwitsoe!
I am trying to use CUDA-PCL on Jetson TX2. But I have encountered a CUDA failure problem which seems to be hard for me to deal with. It would be great if you can give some suggestions.

Environment: Jetson TX2 with Jetpack 4.5 (Ubuntu 18.04, CUDA-10.2, PCL 1.8.1)
Problem: When I run the built demo in each subfolder, it turn out to be a CUDA failure. One output example is given below as I run the demo in cuda-pcl/cuda-segmentation:

nvidia@nvidia-tx2:~/Downloads/cuda-pcl/cuda-segmentation$ ./demo sample.pcd

GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA Tegra X2
Capbility: 6.2
Global memory: 7850MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)


Cuda failure: no kernel image is available for execution on the device at line 310 in file cudaSegmentation.cpp error status: 209
Aborted (core dumped)

I have managed several attempts to solve it.
First, I downgrade to Jetpack 4.4.1 which is the same as the official test environment. But it did not work.
Next, I followed solutions to other similar problem. Specifically, I manually add 62 (which corresponds to the compute capability 6.2 of Jetson TX2) to the SMS variable in makefile. Still, nothing changed.
Since the source code is not there, I can’t do more with it.

I don’t know much about CUDA programing, but I guess the .so file is not compiled with sms=62 so it can’t be executed on Jetson TX2. I would be appreciated if you could fix it for us TX2 users.

Hi triokun,
You are right that the error below means there is no kernel for CURRENT device.
This is because CUDA-PCL was not compiled for SM62.
Cuda failure: no kernel image is available for execution on the device at line 310 in file cudaSegmentation.cpp error status: 209
Aborted (core dumped)

1 Like

Hi, leif!
Thanks for your answering.
I’m wondering if you can recompile the library for TX2 if you have the source code. It would help me a lot.

This is lib for TX2, but it has not been tested because there is no TX2 on local side.

1 Like

I’m grateful for your help. I have tested it on TX2 and it worked perfectly!
Would you mind recompiling the other two lib (libcudafilter.so and libcudaicp.so) for TX2?
Again, thank you so much!

Please check the two libs.

Hi @leif ,

Try building the CUDA-ICP example and got a usr/bin/ld: ./lib/libcudaicp.so: error adding symbols: file in wrong format error.

Environment: GTX 1050 (Ubuntu 20.04, CUDA-10.2, PCL 1.10.1)

USE Default CUDA DIR: /usr/local/cuda
TARGET_ARCH: x86_64
CUDA_VERSION: 11000
SMS: 30 35 50 53 60 61 70 72 
g++ -D_REENTRANT -std=c++11 -std=c++14 -O2 -o demo obj/main.o  -L/usr/lib -L/usr/local/lib -L/usr/local/cuda/lib64 -lcudart_static -lrt -ldl -lpthread -lcudart -L/lib64 -lcudnn -lpthread -L/usr/lib/aarch64-linux-gnu/ -lboost_system -lpcl_common -lpcl_io -lpcl_recognition -lpcl_features -lpcl_sample_consensus -lpcl_octree -lpcl_search -lpcl_filters -lpcl_kdtree -lpcl_segmentation -lpcl_visualization ./lib/libcudaicp.so
/usr/bin/ld: ./lib/libcudaicp.so: error adding symbols: file in wrong format
collect2: error: ld returned 1 exit status
make: *** [Makefile:173: demo] Error 1

How do I get passed that?

Looks like the libraries are compiled for ARM processors. Can you recompile (or provide the source code) for x86_64 ?

Yes, they all work well on TX2 except for CUDA_VoxelGrid. Here is the output;

---------------checking CUDA VoxelGrid---------------------
ERROR case
status = 11

Jetson has a GPU with known type but PC not.
It is hard to adjust cuda-pcl for all GPUs.
We may support X86_64 later.

The VoxelGrid may be not suitable for TX2.
We will try to check it later.

Hi @leif
When i run cuda-icp example and output:

$~/cuda-pcl-main/cuda-icp$ ./demo
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA Tegra X1
Capbility: 5.3
Global memory: 3956MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)

Loaded 859059 data points for P with the following fields: x y z rgb
Loaded 784546 data points for Q with the following fields: x y z rgb
iter.Maxiterate 20
iter.threshold 1e-12
iter.acceptrate 1

Target rigid transformation : cloud_in → cloud_icp
Rotation matrix :
| 0.923880 -0.382683 0.000000 |
R = | 0.382683 0.923880 0.000000 |
| 0.000000 0.000000 1.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.200000 >

matrix_icp native value
Rotation matrix :
| 1.000000 0.000000 0.000000 |
R = | 0.000000 1.000000 0.000000 |
| 0.000000 0.000000 1.000000 |
Translation vector :
t = < 0.000000, 0.000000, 0.000000 >

------------checking CUDA ICP(GPU)----------------
Cuda failure: the launch timed out and was terminated at line 59 in file cudaICP.cpp error status: 702
Aborted (core dumped)

Can you help me fix problem ?
Thanks you.

Hi @nghiaphamsg
Error status: 702 means that :
Specified whether there is a run time limit on kernels
https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_19a63114766c4d2309f00403c1bf056c8
Could you try to boost your device firstly?

Hi jwitose,
I am processing 3D points in a custom way to create 2D images. So, I am going through all the points and want to speed up the process using CUDA. The code is part a ROS node. Could you please let me know whether you have any samples or tutorials for ROS and CUDA especially for Point Cloud processing?
Thanks in advance,
Ahmet