Range search in cuda?

I’ve got a global set of points and I need each cuda thread to have the ability to do range-searches on that set (ie, get a list of all the points within specified radius). I want to use a uniform grid to store a list of all the points in any grid cell, which allows this range search to be computed in near O(1) average time for uniformly distributed set of points.

My question is, how should I go about representing this structure in a way that’s efficient for cuda?

edit - should I simply take all the points and put them into a contiguous list, then create an array of pointers into offsets into the global list, so that I can send all this as 1 block of memory? is that normally the technique that is used, or would that be considered a hackish way?

We discuss just such a data structure in General purpose molecular dynamics simulations fully implemented on graphics processing units
See: http://www.ameslab.gov/hoomd/about.html#paper for a full citation and preprint download.

Regarding your edit, that is not a hackish way at all. While my work has consistently shown that a fully 3D grid structure (actually 4D to store N points per 3D grid cell) works the most efficiently, many other applications and studies have gone the route of sorting the points by grid index and then offsetting into that array. I have made benchmarks with such a data structure, and it is 2-3 times slower than the grid data structure I am using, at least for the conditions I run at. Your mileage may vary.

The paper mentions that we didn’t solve the problem of actually putting the particles in the grid cells on the GPU. I have since solved that and would be happy to discuss the details with you. It is only 4-5x faster than a corresponding CPU implementation, including the host<->device memory transfers, however…

Hi Mr. Anderson, thanks for your reply!

First let me clarify that I’m only interested in doing 2D range searches at this point, not 3D.

I’m a little confused by your notation. You say that neighbor list entries are stored in NBL_{ji} \leftarrow k with the list of neighbors for each particle i going down a column of the matrix. Is NBL a 2D matrix where j is the index into the list of neighbors for particle i? If so then what is k? If that is what you’re saying, then an obvious disadvantage is that the number of rows in the matrix must be the maximum number of neighbors for any point, which means that there will be many blank entries in the matrix. I’m also curious why you would store the size of each list in a separate matrix, rather than simply using a null-terminating symbol at the end. The other disadvantages to doing it this way are that the lists themselves are redundant because the same point may appear in multiple neighbor lists…and construction of the neighbor lists will require doing all the range searches sequentially in the CPU first, which reduces parallelization potential, and additionally there’s more data to send to the GPU. It seems that if one were to prefer pre-computing the range searches in the CPU first, then you could at least reduce memory bandwidth by storing the data like this:


point all_lists[…]; //concatenation of all the neighbor lists. note: this causes redundancy.

point *neighbor_lists[num_particles]; //indexes into all_lists

int list_size[num_particles]; //length of the list


I have no idea what the alternative structure you mentioned that performs 2-3 times slower than this is. What is it?

I would prefer to do the range searches on the GPU…to increase parallelization potential. The method I proposed in my edit would facilitate this. I expect that my performance overall will scale linearly. I first coded this in a single thread, and then threaded it to run in 8 threads on my corei7, and got roughly 8x speedup. I expect that when I run this on a GPU with N cores, I will get Nx speedup. I assume that when you say “it is only 4-5x faster” that you are only referring to the access of neighbor lists by doing the range searches in the GPU instead of the CPU? That factor should go up with larger numbers of particles. In my case I don’t have that many particles, I expect that the total number will be less than 500.

Is the solution that you came up with for doing range searches on the GPU much different from what I proposed in my edit?

The same principles apply, 2D just makes life a lot easier.

Yes. k is the index of the neighbor.

But the not so obvious advantage is that setting it up that way allows for coalesced reads trivially. Non-coalesced reads = a factor of 20 less performance so they need to be avoided. See, in my application, the performance of reading the neighbor list is a lot more important than the cost to build it. GPUs have tons of memory anyway, so the wasted space is almost a non issue.

It is simpler to code the algorithms that read from the neighbor list this way. It also simplifies the divergent warp pattern of the non-uniform length loops.

I don’t generate the neighbor list on the CPU. I do it on the GPU: it is all outlined in the paper. And the “redundant” elements in the list are essential. I’m building these neighbor lists to do molecular dynamics where forces need to be summed over all particles within a given cutoff range. Listing the redundant elements allows the force to be computed in parallel for each force without any dependence on values computed by other threads and actually reduces the total memory bandwidth requirements.

??? You mentioned it explicitly in your post. I was talking about the idea of sorting the particles based on a grid index and then generating offsets into that array.

I don’t know why you are asking so much about the neighbor list. You asked specifically about the grid data structure for storing particles in grid cells. That data structure is discussed in the paper as a stepping stone to get to the neighbor list, and the neighbor list itself is something specific to molecular dynamics and not at all what you want (as you pointed out). When I mentioned a 4-5x speedup, I was just referring to the placing of particles into the grid data structure.

In any case: 500 is probably way to small a system to run on the GPU. GPUs don’t hit peak efficiency until you have 10,000+ threads. And GPUs are also built very different from CPUs so whatever you learned by scaling to an 8-core CPU system likely does not apply. I would highly suggest you read the CUDA programming guide to learn about the architecture.

Forgive me for being so dense, but I’m not seeing a clear difference between pre-computing the neighbor lists using your GPU algorithm compared to just computing the neighbors on-the-fly on the GPU. I see that you can achieve coalesced reads after you’ve precomputed the data, but you didn’t have coalesced reads when you precomputed it either, so that seems it would just have the same overall net effect.

This sounds rather condescending…I know that GPU’s have a different architecture. We have different problems. The peak efficiency for a problem is going to depend on the amount of parallel vs. sequential code. It may be that your problem requires a large number of particles for the overhead costs of data transfer and problem setup to become insignificant, but the amount of work that needs to be done for each particle is so much larger in my problem that a few seconds of overhead are irrelevant.

My apologies for attempting to answer your questions.

And I thanked you for sharing your wisdom, although when I [humbly] asked for clarification you stopped attempting to help me and instead got defensive and insulting. I still have questions unanswered and would welcome a return to the actual discussion at any time…

I was thinking of creating a thread with acknowledgements for some people over here.
There’s something to be learned from most of the posts, but I found some of them to be always quite to the middle of the problem.
Some people are just like a reference. For ex., if Mr. Anderson says something, I take it for true. Sometimes it might be a mistake, but most of the time it’s not.

Also, I would mention wumpus, who was not so active lately, but created decuda, tmurray, cbuchner1, sarnath, and some others. And I’ve been on CUDA forums for some 2 years now. If he recommended reading the manual, it is because it is the most important thing, and he, and others, always do.
Kind regards,

Here is another recent thread discussing the algorithms used to place particles into grid cells: http://forums.nvidia.com/index.php?showtopic=99924

I don’t know how I can answer the original question any more fully.

The original question was:

To reiterate in full detail what has already been said.

  1. I linked my publication which includes a section that outlines a cell data structure that is efficiently laid out for CUDA. Namely, to explain again what is mentioned in the paper: the cell data structure is laid out so that each cell is contiguous in memory so that each block in a grid can read a full cell in a coalesced manner. In that way, blocks can read in an process entire cells in parallel utilizing shared memory and many blocks can be run in a grid to process all cells in parallel.

If you prefer, the cell data structure outlined in that paper could store float4’s containing the particle coordinates instead of indicies. In some applications, that could result in faster performance.

  1. I stated that an alternate solution is store particles in a contiguous list, sort that list by grid cell index and store offsets into that list, as was mentioned in the edit to the original post. It is an effective data structure used by many. However, for my own applications, I have found it to be slower than a structured cell list.

Thanks for summarizing that again. The question I was trying to ask before, was why do you get better efficiency using your special memory access pattern to get coalesced reads if you still have to do the additional step of binning which is non-coalesced? I think that perhaps the answer is on page 8, where you note that you only do binning every 10 time steps, and you assume only neighboring bins need to be checked…reducing it to only 8% of the total computation. I was thinking before that you would do binning after every timestep, which would probably cancel out of the benefit. Anyway, this makes sense and thanks for the advice!

Oh, sorry for misunderstanding.

Yes, the binning/neighbor list is only done once every 10-15 steps (only as needed). The binning is actually super cheap: binning 64k particles only takes 1-2 milliseconds or so on the CPU. It is the actual range search using the grid cells that takes a large amount of time. That and using the generated neighborlist to calculate forces.

And if I could come up with a way to implement binning efficiently with coalesced writes, I would most certainly do so. The problem is that most of the ideas I’ve come up with turn the binning from an O(N) problem to an O(N^2) one which then overwhelms the savings from the now coalesced writes.

What if you repeated the re-binning process using several passes? ie, in the first pass, copy all the points that did not change bins-- all those writes would be coalesced. Then in the second pass, write only the points that shifted one bin to the right, and in the third pass, those that shifted one bin to the left, etc. Is there a reason why a strategy like that wouldn’t work?

I have a question about the CPU Algorithm 1 in your paper. You calculate C = force of Rk acting on Ri, then adjust Fi += C and Fk -= C. However since all particles will considered in the outer loop(as i), it looks like this algorithm would result in giving double the force to every particle. To correct for this, it looks like the “j” index should start iterating at “i”, like this:

for(int i=0; i<N; ++i){

for(int j=i; j<NN(i); ++j){



Am I missing something?

I’ve thought about various types of “update” methods, but haven’t had a chance to try them. In theory, they should work great as you have fully coalesced reads and writes and without the O(N^2) problem. In practice, at least in the one form I did code up, the time spent reading in all 27 neighboring bins to check for particles that have moved still takes a significant amount of time and it was still faster to re-bin the particles from scratch.

I’m not saying that such a method won’t work. Just that I myself haven’t gotten one to work faster than other code I have running that clears the bins and puts the particles in (even on the CPU).

The standard trick is to only store neighbors in the neighbor list where j < i. I guess I forgot to mention that explicitly in the paper.