Nearest Neighbors ICP

One of the key issues of ICP is to find the closest neighbors of a reference point set in 3D to another point set.

I am referring to the brute force search and the point set can consist of millions of points.

The first stage is to find the distances of the reference set R to the target set T. So you keep an array of distances DR for the reference set and an index set IR (not mentioned left as an exercise). Note that you do not need to keep an array of distances for the reference set for the reason explained below.
You divide the Target set into chunks of constants or texture memory structures CT of point sets to get advantage of caching due to non coalescing issues.

Now for each chunk and all the reference point sets you proceed as follows, note first that the reading will be coalesced in the reference set
in the host
for each chunk

Inside the kernel:

for each thread
i = blockIdx.x*blockDim.x + threadIdx.x;
set RP = R[i]; //coalesced
min = DR[i]; //coalesced
for (j = 0; j < CHUNK_SIZE ; j++){
[indent]if (Dist < min)[/indent]
[indent][indent]min = Dist;[/indent][/indent]
DR[i] = min; //coalesced

Any impovements you can suggest are welcome. I think that this is a nice example of seeing the difference between GF100 and GT200. A good discussion will be how the Fermi architecture is going to react on the different stages of the algorithm. Any optimization ideas for the Fermi architecture are very welcome. I have now dived into the world of Fermi and I am a good learner. :-)

I have no experience on this topic myself, but it just reminded me of this post:
Fast k nearest neighbor search using GPU: code now available.

Exactly it is the paper that I used only there is no need for the sorting step. :-)

True, but this only works if you are doing 1-nearest neighbor search. If you are interested in finding the k-nearest neighbors (k>1), you can avoid storing the distance array for small values of k using a sort-merge technique. I talked about this a bit here:

Code release: Random Ball Cover (RBC) for fast nearest neighbor

You can also have a look at the code if interested…


I was not aware of this work. Thanks for the pointer. I am going to do some tests with your code also. I need a k-nearest neighbor not for this phase but on the phase finding the normals of a 3D point set where there is no structural information, just the point set. Thanks! Also I will write you my experience using it with Mac OS X using a GTX 480 card. I made this beast working with Mac and I am just thrilled with its performance so far. :-)



I have some experience regarding nearest neighbor search and the ICP algorithm,

and I think you should really reconsider the use of a brute force approach for your problem.

Let’s compare your configuration with the configuration in the “Fast k Nearest Neighbor Search using GPU” paper.

Your configuration:

    dimension of point sets: d = 3

    number of points: n > 1000000

Configuration in the paper:

    dimension of point sets: d >= 8

    number of points: n <= 38400

Lower dimensionality and higher number of points both favor

more sophisticated algorithms like the k-D tree algorithm.

I might even go as far as saying that in your case a k-D tree algorithm running on

a single CPU core will wipe the floor with brute force search on the fastest GPU you can buy.

One of my publications contains a comparison of CPU implementations of the respective algorithms:


When your model point set (reference point set) is static,

and your data does not permit more efficient approaches like normal shooting,

I think your best bet is to implement a (simple) k-D tree algorithm on the GPU.


Thanks I will make a test with a million point data set both reference and target and I will post here the time taken and code with brute force search, concluding in this way the subject.



22 secs for a million query points and reference points with 5000 chunck size constant memory. There is a need for something faster indeed. But 22 secs is not tragic and I think it beats the kD tree. My OS is the Mac OS X and it is on the GTX 480. Usually there is no need to exceed far from 1000000 points. Anyway I will test also the Randomized version of KNN all in good time. 1 million queries on 1 million data set in under 4 secs as mentioned and I am certain with my card it will be faster. This is quite acceptable. :-)

New code made that take ms but it could be wrong. Debugging and posting the final code. If the timings I get are valid then it would be worth the post if not just for the record.

It is just quick until 100000 less than a second. I used textures. I am just posting the code just for the record. Of course a more optimal version can exist. It is rather simplistic but does the job. (1.66 KB)

This is much faster than I had expected. The raw processing power of the GTX 480 is quite impressive.

The experiments in my paper suggest that the k-D tree version

takes approximately six seconds on an Athlon XP 1500+.

BTW, this result is for double precision data.

When both point sets have 1 million points, you have to do 1e12 distance computations.

Each distance computation requires at least 5 floating point operations,

plus one operation for the comparison, giving a total of 6e12 operations.

The GTX 480 is rated at 1350 GFlops, which is 1.35e12 operations per second.

Therefore, brute force search takes at least four seconds.

I did not read the literature on k-D tree algorithms on the GPU,

and the back-tracking stage of the algorithm might make

a straight-forward implementation difficult.

However, I am still convinced that your problem configuration

can vastly benefit from a more sophisticated search algorithm.

Best regards,


Yes indeed this is for constant memory (not included in the code). The GPU was fully utilized I could hardly do anything else in the OS X.

I will try also to implement a kd-tree based on the GPU. There is a paper that describe the methodology.

So regarding memory latencies I have reached close.

You are probably right. I will do some experiments.



I have a question how much time does it take to construct the RBC of 1000000 query and database 3D points and what are the nr, s that you should use? For instance if you take your formula from the paper nr=s=sqrt(n)log(n) how much time does the construction of the RBC take?

In my Mac OS X with a GTX 480, being certain that the GPU runs as it should and using your code with random 3D points, it takes a long time. The query though takes just 6secs. Is the long time valid?

I think it should be valid. You are probably not interested in the RBC but the query time. For instance if the Database Points remain fix you can have multiple 1000000 queries in a dynamic system that gives you multidimensional data and want to compare with the database as with the Robot arm from which you require the data points. If though the Database Points change then there is a great problem. In my case the Database Points also change in number and in coordinates. Thus I think that a GPU kd-tree is the only choise… anyway just to verify your 4secs. In my case it is 6secs but you made your tests in a Tesla card so naturally with a code not optimized for the Fermi architecture it takes longer.

Best Regards,


As a comparison, I built a RBC on 1M database points (21-dimensional), and ran 1M queries on it. The total time (including the time to build and the time to search) was under 12 seconds. That was on a Tesla c2050, but I’d be surprised if the performance is very different on a gtx 480 (though I don’t own one). Note that was the performance for 1-NN search ( using the queryRBC() function ); the performance for k-NN search ( using the kqueryRBC() function ) will be somewhat worse, though both methods use the same build function.

There are some suggested parameters in the paper. A good starting point is nr=s=1024; maybe also try {2048, 3072, 4096, 5120} if you want to play with it, but 1024 should work well for you.

How long was it exactly?

A kd-tree is very hard to beat in low-dimensional spaces, even on a CPU…

It takes more than 2 minutes. I think this is enough to be prohibitive if the Database changes.

Yeah, that doesn’t sound right to me. Let’s take this to email. Can you send me your driver so I can take a look?

Ok going into e-mail mode. :-)


After some discussion with Lawrence he pointed out that I should have used fewer representatives not exceeding 3000

So in order to make things right and not pass a wrong impression about his paper I am posting the results on my GTX-480 which are slightly better than c2050

On the GTX-480 with Mac OS X

for nr=s=2048

running rbc…

.. build time for rbc = 4.0294 

.. query time for rbc = 0.8915 

for nr=s=3072

running rbc…

.. build time for rbc = 5.9193 

.. query time for rbc = 1.3230 

It is slightly better than c2050

Around five seconds of query time is not bad at all, the error rate is questionable since in ICP there is error, maybe an efficient scheme for ICP could be found with Lawerence paper. Maybe.



Also an important note on ICP. There is a need for only one time to construct the RBC since only the query data points are getting translated and rotated. So an initialization step of 4secs and 0.9secs per iteration is quite acceptable. If the data points are very dense then you can do a uniform subsampling on the datapoints so as never to exceed 5 million points for example. One needs to think also the memory constraints on the GPU. Also one can add color, principal curvatures and increase the dimensionality of the problem.

Here is a video of the famous Scanalyze boosted with CUDA using RBC. Sorry it is in production stage no executable/code available. To be released on June in Mac Store for free. :-)

Observe how fast the iterations go.