I’ve been working on a CUDA implementation of a particle in cell algorithm, and there is one part which takes a long time, but which I haven’t been able to put on the gpu device yet. Here is the problem:

There is a system of N particles on a grid G of grid size h. These particles interact, and I want to calculate their trajectories. This problem is of course handled in one way, the brute force way, in the Nbody example. This is nice for small N, but for very large N (>100k) this method is exceedingly slow. So instead of calculating N^2 interactions, you calculate the particle charge to the grid points, solve the Poisson equation on the system for potential, and interpolate the grid potentials to the particles to get the forces. Very well, for large N, it does look like this is faster now. However, a lot of time is still spent doing the part I haven’t put in parallel, weighting the particle charge to the grid points.

The problem to put it in parallel is that if each particle is a thread, then they may have to read and write to the same grid locations multiple times. If each grid point is a thread, then it seems you still have to loop through each particle to find which ones are close, and save no time.

I thought I’d do this maybe with atomicAdd(&grid[index],weight), but it says this is only supported for integers, and the particle charge weighting needs to be a float. The atomicExchange function maybe could work, but still you need to read that grid location’s current charge to know how to update it, and I thought that would be a problem.

Has anyone seen a solution to this problem, or have an idea of how to approach it?

I’ve done similar algorithms for monte carlo particle simulation. There is no one “correct” answer since it all becomes dependent on your exact problem. In particular if you’re dealing with a few thousand particles, streaming them is cheap and easy.

But my app needs 2-5 M particles, so some spatial data structure is necessary. What I ended up doing was building a binary bounding volume hierarchy, forming a tree where each node has two children, each with its own (potentially overlapping) 3D axis aligned bounding boxes.

Each grid point GATHERS its nearby contributors by walking the tree with a local stack of nodes. (There’s no threading issues when gathering!).
After particle position update, I update the BVH tree with the new bounds. Each update iteration makes the tree more and more inefficient as the bounds start diffusing larger and larger, so you start needing occasional “tree refresh” steps which performs local node rotations and rebounding swaps to optimize the tree again.

If your particles move slowly (ie, RMS speed of something like 1/4 your grid resolution per timestep) this should work well. If your particles move faster, then you may skip the tree update method and just go into building a new tree per timestep.

This strategy was influenced by a lot of work I’ve done in raytracing, which has similar problems of dynamic changes to a geometric database that you need to continuously query.

Other approaches are to keep particles in per-voxel grid lists, then let the particles move from one to another. This works pretty well for many apps like molecular simulation where the density of particles is self-limiting so you can make fixed sized voxel lists and not worry about overflow. HOOMD is one terrific example of the per-voxel particle storage and migration method. If you have no density limits, you can’t preallocate lists like this so trees become the only solution. Trees work fine for low density too, but grids are much easier to program.

Thanks SP Worley, that’s just the kind of advice I needed! The tree idea sounds promising, I will look into this. I forgot that each grid doesn’t need to query the entire particle position list, only the nearby grids to see if there are particles there. That’s great.