Optimizing nearest-neighbors querying

Hi,

I’m implementing an SPH solver. The core of an SPH solver is nearest neighbor querying (NNQ).
The performance of NNQ is responsible for approx. 90% of the overall simulation time.

My question is: how do I optimize nearest neighbors querying?

I use the cell-linked list approach (which uses a virtual 3D grid (16x16x16) to discretize the particles into grid cells).
The algorithm to preprocess the particle’s position and velocity buffers (i.e. sort them according to the particle’s grid cell’s index):

  1. generate gridKey* for each particle, based on the particle's position
  2. sort both the position and velocity buffers according to the gridKey assigned with particular particle
<i>* gridKey is mapping from uint3 -> uint. I.e. from 3D grid cell coordinate to linear ID (key).</i>

Querying neighbors for one particle consists of the following steps:

  1. calculate gridKey for the particle
  2. determine the 26 neighboring grid cells, that surround the cell which contains our particle
  3. iterate over all particles from the neighboring 26 cells and also over the particles from the cell which contains our particle - hence 27 cells. Inside the iteration loop, our particle's position is compared with all the particles from 27 cells. The result of this comparison is added to an accumulation variable (sidenote: accumulation variable = density for our particle).
  4. After we've iterated over all neighboring particles, the accumulation variable is written to a global buffer.

My problem is organization of work into blocks and the number of threads in a block.
I’ve tried several ways to organize the work, but even the fastest of them is still slow for me. I’ve tried the
following;

  1. Particle-point-of-view
    1. 1 thread = 1 particle, block-size = the maximal count of threads per block grid-size = number-of-particles/block-size
      1. a thread loads its particle and determines its grid cell (uint3)
      2. use 3 nested for-loops to iterate over ranges [-1, +1] in each X,Y and Z, representing the offset from the current particle's grid cell.
      3. in the innermost for loop, apply the offset and if we're not out of grid bounds, proceed
      4. iterate over the range of the current (the offsetted) grid cell and let our particle interact with all the neighboring particles.
      5. after we iterate over all particles, write the result (the accumulation variable mentioned before) into a global buffer
      	</li><li> [this is possible only if the grid key is linear**] the same as a), but instead of 3 nested loops,
      	   use only 2 and exploit the fact that in X axis direction, the cells are stored immediately after
      	   each other
      
      	</li><li> the same as 1), but process 2 particles per thread</li>
      
      	[i]** For illustration of what linear space-filling curve is, take a look here (its the "a) row"
      	   http://ww4.sinaimg.cn/mw690/7178f37ejw1emyk4zds00j20e908igmv.jpg[/i]
      
    </li><li><u> Grid-point-of-view</u>
    
    1. 1 block = 1 grid cell, block-size = the maximal allowed number of particles in a grid-cell (based on the rest density), grid-size = the number of grid cells in grid (for 16x16x16 grid, it's 4096)
      	   	Premise: all particles in a grid cell have to iterate over the same neighboring particles.
      	   	Load all the particles at once into shared memory and then iterate over them.
      
      1. coalescedly load all particles' positions from the currently processed cell into register of each thread in block (this is possible, because the particles' positions are sorted by their cell key). With that being done, each thread in block has its corresponding particle's position stored in register. (Only as many threads in block will read its particle positions, as there're particles in the currently processed cell)
        	     </li><li>  similarly to step 0, all threads in a block will coalescedly load all particles from the 27 neighboring
        	        cells into shared memory, one cell at a time.
        
        	     </li><li>  now that we have all the neighboring-particle-candidates in a shared memory, we'll just
        	        iterate over them. Since each thread represents 1 particle, we process all particles in a grid cell
        	        per one iteration over the shared memory.
        
        	     </li><li>  after we iterated over all particles from shared memory, we'll coalescedly write (similar to step 0)
        	        the results of the interaction into a global buffer.</li>
        
      </li><li>  similar to 1), but 1 block processes 8 grid cells and block-size = 4*max-particles-per-grid-cell,
      	   we process 4 grid-cells at once.</li>
      

The fastest way of nearest neighbors query is (surprisingly enough for me) the 1.1. method (which is still too slow for me).

Given that the shared memory of a block is unused gives me impression that there’s still a hidden potential to optimize my search somehow. I’ve also thought about using the shared memory for the 1.1. case in the following way: the blocks’s threads load particles. The particles (i.e. their positions) are sorted, hence 1 block contains continuous sequences of particles from several grid cells;  [| 5 | 6 |7|]
Legend:
           = block of threads
          |x| = threads of the block containing particles from grid cell (or rather grid key) no. x

Given that all the particles in the range | 5 | load the same neighbors, I could load their neighbors to the shared memory coalescedly (like I do in the step 2.1.1., and then iterate over them, processing just the particles from | 5 |. The rest of threads would process their particles just like in 1.1… I haven’t tried this approach, because although shared memory would be used, there would be thread divergence (given that I’d need to discriminate between the first sequence of threads and the rest).

I guess that the way 1.1. is fastest, because it implicitly coalesces global memory reads; threads from one sequence*** execute the same code (branch in the same way etc.) and iterate over exactly the same neighbors, so when e.g. 10 threads request the same place in global memory, the HW reads the place just once.

(***sequence being the threads processing particles from identical grid cell - e.g. | 5 |)

Is this explanation correct?

TLDR: How would you further optimize nearest neighbors query?

Kudos to you if you made it through the text :).
Thank you for your time and opinions.

This is outside my area of expertise, I had to look up SPH. Generally speaking, assigning one thread per result datum (matrix element, particle, etc) is often a good organization for CUDA implementations.

What recent literature have you consulted? There seems to be quite a bit out there. For example, section 4 of the following presentation highlights some optimization techniques for SPH solvers on GPUs:

[url]http://www.scd.stfc.ac.uk/SCD/resources/pdf/sph/Dominguez_DualSPHysicsOpenMPMulti-GPUSPHsolver.pdf[/url]

I’ve read almost everything that exist :D (including the PDF you posted).

I started by reading Position Based Fluids, which didn’t discuss the implementation of NNS, but rather pointed me to Simon Green’s Particle Simulation using CUDA.

Then I read Rama Hoetzlein’s Fast Fixed-Radius Nearest Neighbors. After that I read Prashant Goswami’s Interactive SPH Simulation and Rendering on the GPU and after a while I noticed that on Rama Hoetzlein’s fluids3 website it’s written that in fluids3 implementation they already tired Goswami’s approach (using shared memory) but the code was 4x slower, which I can second as I tried to implement it.

Of course, I’ve read many, many more literature, the above sources are the most influential sources.

Sidenote: After reading dozens of papers on SPH, I noticed that almost no paper goes into the neighbors querying implementation. Most of the papers just state “Find neighboring particles”.

Would this code be of any use in your work, or is that a totally different kind of nearest neighbor search:

[url]http://vincentfpgarcia.github.io/kNN-CUDA/[/url]

Thank you for your determination to help me!

Several hours ago I’ve actually downloaded the source of kNN CUDA (what a coincidence) and right now I’m reading through it, so I’ll see if it helps. If I make a significant speedup, I’ll let you know.

After reading the the kNN CUDA NNS paper, it sounds like a ridiculously slow algorithm for my application – even slower than the naive O(N^2) approach, where you iterate over all particles for every particle – kNN CUDA computes distances between a particle and all other particles, storing the distances into buffer, sorting the buffer and selecting the first k particles. It seems like it was designed for accelerating CPU code.

1 Like

You might look at open-source HOOMD-Blue which uses its own binned grid method for gathering points for local interaction. There are paper(s) on the website about the algorithm used, and the author, Mr. Anderson, used to post to this forum quite regularly.

Thank you for pointing me to this project! I guess I’ll try to implement the Stenciled cell list, which means subdividing my original grid into a more granular grid and then for each particle just selecting those grid cells that are closest to a particle.

Plug: we have a similar algorithm [1] also for molecular simulation, though this is more general and targets pair-interaction calculation on pretty much any SIMD and GPU architectures as it allows neatly tuning for SIMD width, data reuse, etc. (it runs on ~10 SIMD CPU and 2 GPU arch). It generates a staggered grid of non-uniform shaped cells with uniform particle count in a hierarchical fashion. The algorithm is implemented and SIMD-optimized only on CPUs but not on GPUs (for portability reasons and because we can use algorithmic tricks to decrease the frequency of gridding + search).

This algorithm will have optimal work efficiency and works very well in the density regimes we are interested in and does not suffer form the non-uniformity hence divergence issues you end up with when you do spatially uniform gridding. (But in its current form non-uniform particle density is not a target.)

[1] A flexible algorithm for calculating pair interactions on SIMD architectures - ScienceDirect

This is really interesting! A skim of the paper looks like the approach may be useful for many related kinds of algorithms like SPH or even explicit chemical kinetic reaction simulations.

Is your neighbor-finding code in the core GROMACS source? I’m curious to skim it, especially to see the use of SSE intrinsics.

Unfortunately, the paper pointed to is “closed source” (i.e., paywalled). I will see whether I can track down a free preprint, if there is one. I am perusing the GROMACS source right now, but haven’t been able to locate the functionality outlined above yet.

Here’s the free preprint: [1306.1737] A flexible algorithm for calculating pair interactions on SIMD architectures
And here’s a improved version of the SIMD calculation diagram (Fig 4): GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers - ScienceDirect

The source is in GROMACS, but it’s, I guess, somewhat hidden due to layers of complexity and legacy code. Look at the “nbnxn” modules in https://github.com/gromacs/gromacs/tree/master/src/gromacs/mdlib:

  • nbnxn_grid, nbnxn_search: gridding/pair list building code
  • nbnxn_kernels/ contains the CPU kernels (that use a generic SIMD layer)
  • nbnxn_cuda/ nbnxn_opencl/ are the GPU modules with kernels + glue code

Thanks for the pointer to the pre-print. BTW, I quite like how the GROMACS folks have abstracted various sets of proprietary SIMD intrinsics into one set of generic SIMD intrinsics. Given the continued proliferation of new SIMD ISAs, that is probably never-ending work …

Do let me/us know how/if you managed to adapt the algorithm. The idea of particle-cluster as a unit of computation should transfer to many use-cases. Its usefulness will of course depend on the parallel work-efficiency (when a spherical interaction range is needed) vs divergence tradeoff with particle-based neighbor lists.

As noted above and commented on by njuffa, the SIMD operations are expressed with a SIMD abstraction layer, so it’s not SSE, but through our SIMD layer implementation it supports ten or so SIMD flavors.

@njuffa: it is a never-ending work, but what’s the alternative? It’s also a significantly less daunting task than re-implementing all compute kernels for each CPU architecture (and the tradeoffs are in our case minor in generality vs specialization).

As I stated, I like this (very thin) SIMD abstraction layer used by GROMACS. As long as explicit SIMD is the one major way for CPU architects to provide increased compute throughput, I do not see a realistic alternative to this approach. Use of this abstraction layer certainly beats programming with a multitude of proprietary SIMD intrinsics in terms of overall productivity.

Has the GROMACS teams considered “marketing” the abstraction layer itself to other projects? If multiple projects were to use it, that would certainly amplify the overall benefits.

After spending many years using CUDA’s implicit SIMD, I find explicit SIMD to be a pain in the behind, generally speaking, not something I would want to expose myself to voluntarily on a regular basis. I believe this is an evolutionary dead end in ISA design, something I already suspected when going from 2-way 3Dnow! to 4-way SSE SIMD, way back when. To this day, even Intel’s compilers are struggling to auto vectorize code for their SIMD extensions, and very people volunteer to optimize this way by hand, so lots of applications are letting a lot of compute power fall by the way side.

We need to make clear distinction between ISA per se and its exposition in programming tools. To much surpise, I realized that GCN ISA and AVX512 has (almost?) no difference. Let’s consider this example:

if (a < b)
  a = b;
else
  b = a;

CUDA compiles such small conditionally-executed blocks into predicated operations. GCN has no predicated operations, instead it compiled to dumb straightforward code:

[V] S1 = (a < b)      ; Compare two vectors and save results into 64-bit scalar register
    S2 = EXEC         ; EXEC is a hardware register masking execution of any vector operation
    EXEC &= S1
    if (EXEC==0)  goto else_part
[V] a = b
else_part:
    EXEC = S2 & ~S1
    if (EXEC==0)  goto following_code
[V] b = a
following_code:
    EXEC = S2

Two interesting things here. First, while it performs more operations than GeForce, only 3 of these operations (marked with [V]) occupy the vector engine - others are perfromed on int/branch engines, so we can count them (in most circumstances) as free. It’s probably the reason why GCN doesn’t have prefixed operations - vector engine just doesn’t have access to scalar registers storing predicates, and OTOH these extra operations cost nothing.

Second, this code can be translated to AVX512 as easy as to GCN ISA. Scalar registers storing predicates should be replaced with mask registers, and all vector operations should be predicated with a mask register emulating EXEC. That’s all. Moreover, if operations on mask registers can be executed on ports 5/6 (usually free in SIMD code), then the mask machinery will be as cheap as in GCN ISA.

There are few caveats - in particular, I don’t remember whether AVX512 can easily execute “if (EXEC==0) goto”. Also, I’ve heard that GPUs can optimize EXEC states stack by executing branch with less threads enabled first. But all that is still doable in AVX512 with fixed amount of operations.

This means that OpenCL can be compiled to efficient AVX512 code, and it was probably Intel goal from the start.

Not sure whether Intel’s recent SIMD hardware designs have been informed by the needs of OpenCL. Intel has their own SPMD compiler ([url]https://ispc.github.io/[/url]), which is a project originally started by a former NVIDIA engineer, if I recall correctly.

Ack, thanks.

Yes, but time has not allowed it so far, but we should reconsider again.

Indeed the gap between the achieved and achievable CPU performance is increasing and even Intel admits that manual vectorization is necessary, especially in HPC. However, while explicit SIMD is ugly, I’m not sure it is more challenging to write than good CUDA (and OpenCL) code*. CUDA is more clean and elegant, but that alone does not directly help in achieving good performance.

*addendum: one significant thing that makes writing fast CUDA code easier is that different addresses are allowed across lanes.

I happened to notice that the Google Skia project has its own SIMD abstration layer. From skimming the design and code it looks useful and clean. I think SIMD abstraction libraries tend to be often invented and reinvented so there are probably dozens of them out there in both closed and open source. I made my own small subset back in the day to perform raytracing with both SSE and PPC Altivec.

Interesting, thanks for sharing that.

I guess going from a SIMD abstraction design tailored for a single project’s use to a general-purpose one is an effort large enough to prevent such libraries from being released (and feature-completed).