I’m implementing an SPH solver. The core of an SPH solver is nearest neighbor querying (NNQ).
The performance of NNQ is responsible for approx. 90% of the overall simulation time.
My question is: how do I optimize nearest neighbors querying?
I use the cell-linked list approach (which uses a virtual 3D grid (16x16x16) to discretize the particles into grid cells).
The algorithm to preprocess the particle’s position and velocity buffers (i.e. sort them according to the particle’s grid cell’s index):
- generate gridKey* for each particle, based on the particle's position
- sort both the position and velocity buffers according to the gridKey assigned with particular particle
<i>* gridKey is mapping from uint3 -> uint. I.e. from 3D grid cell coordinate to linear ID (key).</i>
Querying neighbors for one particle consists of the following steps:
- calculate gridKey for the particle
- determine the 26 neighboring grid cells, that surround the cell which contains our particle
- iterate over all particles from the neighboring 26 cells and also over the particles from the cell which contains our particle - hence 27 cells. Inside the iteration loop, our particle's position is compared with all the particles from 27 cells. The result of this comparison is added to an accumulation variable (sidenote: accumulation variable = density for our particle).
- After we've iterated over all neighboring particles, the accumulation variable is written to a global buffer.
My problem is organization of work into blocks and the number of threads in a block.
I’ve tried several ways to organize the work, but even the fastest of them is still slow for me. I’ve tried the
- 1 thread = 1 particle,
block-size = the maximal count of threads per block
grid-size = number-of-particles/block-size
- a thread loads its particle and determines its grid cell (uint3)
- use 3 nested for-loops to iterate over ranges [-1, +1] in each X,Y and Z, representing the offset from the current particle's grid cell.
- in the innermost for loop, apply the offset and if we're not out of grid bounds, proceed
- iterate over the range of the current (the offsetted) grid cell and let our particle interact with all the neighboring particles.
- after we iterate over all particles, write the result (the accumulation variable mentioned before) into a global buffer
</li><li> [this is possible only if the grid key is linear**] the same as a), but instead of 3 nested loops, use only 2 and exploit the fact that in X axis direction, the cells are stored immediately after each other </li><li> the same as 1), but process 2 particles per thread</li> [i]** For illustration of what linear space-filling curve is, take a look here (its the "a) row" http://ww4.sinaimg.cn/mw690/7178f37ejw1emyk4zds00j20e908igmv.jpg[/i]
- 1 block = 1 grid cell,
block-size = the maximal allowed number of particles in a grid-cell (based on the rest density),
grid-size = the number of grid cells in grid (for 16x16x16 grid, it's 4096)
Premise: all particles in a grid cell have to iterate over the same neighboring particles. Load all the particles at once into shared memory and then iterate over them.
- coalescedly load all particles' positions from the currently processed cell into register of each
thread in block (this is possible, because the particles' positions are sorted by their cell key).
With that being done, each thread in block has its corresponding particle's position
stored in register. (Only as many threads in block will read its particle positions,
as there're particles in the currently processed cell)
</li><li> similarly to step 0, all threads in a block will coalescedly load all particles from the 27 neighboring cells into shared memory, one cell at a time. </li><li> now that we have all the neighboring-particle-candidates in a shared memory, we'll just iterate over them. Since each thread represents 1 particle, we process all particles in a grid cell per one iteration over the shared memory. </li><li> after we iterated over all particles from shared memory, we'll coalescedly write (similar to step 0) the results of the interaction into a global buffer.</li>
</li><li> similar to 1), but 1 block processes 8 grid cells and block-size = 4*max-particles-per-grid-cell, we process 4 grid-cells at once.</li>
- coalescedly load all particles' positions from the currently processed cell into register of each thread in block (this is possible, because the particles' positions are sorted by their cell key). With that being done, each thread in block has its corresponding particle's position stored in register. (Only as many threads in block will read its particle positions, as there're particles in the currently processed cell)
- 1 thread = 1 particle, block-size = the maximal count of threads per block grid-size = number-of-particles/block-size
The fastest way of nearest neighbors query is (surprisingly enough for me) the 1.1. method (which is still too slow for me).
Given that the shared memory of a block is unused gives me impression that there’s still a hidden potential to optimize my search somehow. I’ve also thought about using the shared memory for the 1.1. case in the following way: the blocks’s threads load particles. The particles (i.e. their positions) are sorted, hence 1 block contains continuous sequences of particles from several grid cells; [| 5 | 6 |7|]
= block of threads
|x| = threads of the block containing particles from grid cell (or rather grid key) no. x
Given that all the particles in the range | 5 | load the same neighbors, I could load their neighbors to the shared memory coalescedly (like I do in the step 2.1.1., and then iterate over them, processing just the particles from | 5 |. The rest of threads would process their particles just like in 1.1… I haven’t tried this approach, because although shared memory would be used, there would be thread divergence (given that I’d need to discriminate between the first sequence of threads and the rest).
I guess that the way 1.1. is fastest, because it implicitly coalesces global memory reads; threads from one sequence*** execute the same code (branch in the same way etc.) and iterate over exactly the same neighbors, so when e.g. 10 threads request the same place in global memory, the HW reads the place just once.
(***sequence being the threads processing particles from identical grid cell - e.g. | 5 |)
Is this explanation correct?
TLDR: How would you further optimize nearest neighbors query?
Kudos to you if you made it through the text :).
Thank you for your time and opinions.