One of the key issues of ICP is to find the closest neighbors of a reference point set in 3D to another point set.
I am referring to the brute force search and the point set can consist of millions of points.
The first stage is to find the distances of the reference set R to the target set T. So you keep an array of distances DR for the reference set and an index set IR (not mentioned left as an exercise). Note that you do not need to keep an array of distances for the reference set for the reason explained below.
You divide the Target set into chunks of constants or texture memory structures CT of point sets to get advantage of caching due to non coalescing issues.
Now for each chunk and all the reference point sets you proceed as follows, note first that the reading will be coalesced in the reference set
in the host
for each chunk
Inside the kernel:
for each thread
i = blockIdx.x*blockDim.x + threadIdx.x;
set RP = R[i]; //coalesced
min = DR[i]; //coalesced
for (j = 0; j < CHUNK_SIZE ; j++){
[indent]CTP=CT[j];[/indent]
[indent]Dist=dist(CTP,RP);[/indent]
[indent]if (Dist < min)[/indent]
[indent][indent]min = Dist;[/indent][/indent]
}
__synchthreads();
DR[i] = min; //coalesced
Any impovements you can suggest are welcome. I think that this is a nice example of seeing the difference between GF100 and GT200. A good discussion will be how the Fermi architecture is going to react on the different stages of the algorithm. Any optimization ideas for the Fermi architecture are very welcome. I have now dived into the world of Fermi and I am a good learner. :-)
True, but this only works if you are doing 1-nearest neighbor search. If you are interested in finding the k-nearest neighbors (k>1), you can avoid storing the distance array for small values of k using a sort-merge technique. I talked about this a bit here:
I was not aware of this work. Thanks for the pointer. I am going to do some tests with your code also. I need a k-nearest neighbor not for this phase but on the phase finding the normals of a 3D point set where there is no structural information, just the point set. Thanks! Also I will write you my experience using it with Mac OS X using a GTX 480 card. I made this beast working with Mac and I am just thrilled with its performance so far. :-)
Thanks I will make a test with a million point data set both reference and target and I will post here the time taken and code with brute force search, concluding in this way the subject.
22 secs for a million query points and reference points with 5000 chunck size constant memory. There is a need for something faster indeed. But 22 secs is not tragic and I think it beats the kD tree. My OS is the Mac OS X and it is on the GTX 480. Usually there is no need to exceed far from 1000000 points. Anyway I will test also the Randomized version of KNN all in good time. 1 million queries on 1 million data set in under 4 secs as mentioned and I am certain with my card it will be faster. This is quite acceptable. :-)
New code made that take ms but it could be wrong. Debugging and posting the final code. If the timings I get are valid then it would be worth the post if not just for the record.
It is just quick until 100000 less than a second. I used textures. I am just posting the code just for the record. Of course a more optimal version can exist. It is rather simplistic but does the job. NN.zip (1.66 KB)
I have a question how much time does it take to construct the RBC of 1000000 query and database 3D points and what are the nr, s that you should use? For instance if you take your formula from the paper nr=s=sqrt(n)log(n) how much time does the construction of the RBC take?
In my Mac OS X with a GTX 480, being certain that the GPU runs as it should and using your code with random 3D points, it takes a long time. The query though takes just 6secs. Is the long time valid?
I think it should be valid. You are probably not interested in the RBC but the query time. For instance if the Database Points remain fix you can have multiple 1000000 queries in a dynamic system that gives you multidimensional data and want to compare with the database as with the Robot arm from which you require the data points. If though the Database Points change then there is a great problem. In my case the Database Points also change in number and in coordinates. Thus I think that a GPU kd-tree is the only choise… anyway just to verify your 4secs. In my case it is 6secs but you made your tests in a Tesla card so naturally with a code not optimized for the Fermi architecture it takes longer.
As a comparison, I built a RBC on 1M database points (21-dimensional), and ran 1M queries on it. The total time (including the time to build and the time to search) was under 12 seconds. That was on a Tesla c2050, but I’d be surprised if the performance is very different on a gtx 480 (though I don’t own one). Note that was the performance for 1-NN search ( using the queryRBC() function ); the performance for k-NN search ( using the kqueryRBC() function ) will be somewhat worse, though both methods use the same build function.
There are some suggested parameters in the paper. A good starting point is nr=s=1024; maybe also try {2048, 3072, 4096, 5120} if you want to play with it, but 1024 should work well for you.
How long was it exactly?
A kd-tree is very hard to beat in low-dimensional spaces, even on a CPU…
After some discussion with Lawrence he pointed out that I should have used fewer representatives not exceeding 3000
So in order to make things right and not pass a wrong impression about his paper I am posting the results on my GTX-480 which are slightly better than c2050
On the GTX-480 with Mac OS X
for nr=s=2048
running rbc…
.. build time for rbc = 4.0294
.. query time for rbc = 0.8915
for nr=s=3072
running rbc…
.. build time for rbc = 5.9193
.. query time for rbc = 1.3230
It is slightly better than c2050
Around five seconds of query time is not bad at all, the error rate is questionable since in ICP there is error, maybe an efficient scheme for ICP could be found with Lawerence paper. Maybe.
Also an important note on ICP. There is a need for only one time to construct the RBC since only the query data points are getting translated and rotated. So an initialization step of 4secs and 0.9secs per iteration is quite acceptable. If the data points are very dense then you can do a uniform subsampling on the datapoints so as never to exceed 5 million points for example. One needs to think also the memory constraints on the GPU. Also one can add color, principal curvatures and increase the dimensionality of the problem.
Here is a video of the famous Scanalyze boosted with CUDA using RBC. Sorry it is in production stage no executable/code available. To be released on June in Mac Store for free. :-)