MD - Neighbour List Update How to re-construct (update) Neighbour Lists on GPU

DrAnderson42 · July 22, 2011, 12:20pm

Yes, indeed. Check out CellListGPU.cu in the current hoomd trunk for a very fast implementation of the cell list generation on the GPU. On G80 cards, going back to the CPU was a minimal overhead (~a few %), but on Fermi the rest of the code got so much faster that going back to the CPU for the cell list was approaching 15% of the run time - so a GPU solution was needed.

On Fermi cards with atomic ops in L2 cache, it performs very fast. I’m in the process of preparing a subsequent publication with these new algorithms, but for now they are already released and available (and you can always ask me questions about them). The new neighbor list code now only runs 1 particle per thread and relies on the cache hierarchy to efficiently read in the data - the bottleneck in this code is actually the write operation!

As with many O(N) vs O(NN) algorithm comparisons, there will likely be a break even point at which the O(NN) algorithm is faster for small N. I haven’t extensively benchmarked this since the G80 days, but since you ask about a specific system size (1024 particles) I ran a quick benchmark in hoomd of a lj liquid on a GTX 580.

O(N) nlist

Average TPS: 4434.61

Simulation:    11.3s | 100.0% 

        Integrate:     3.3s | 29.1% 

                NVT step 1:     0.6s | 5.5% 

                        PDATA C2G:     0.0s | 0.2% 419.04 MiB/s 

                        Self:          0.6s | 5.3% 

                NVT step 2:     0.7s | 5.8% 

                Net force:      0.6s | 5.4% 

                Thermo:         0.4s | 3.5% 

                Self:           1.0s | 9.0% 

        Neighbor:      3.6s | 32.1% 

                Cell:           0.2s | 1.4% 

                        compute:     0.1s | 0.9% 

                        init:        0.0s | 0.0% 

                        Self:        0.1s | 0.5% 

                compute:        2.8s | 24.8% 

                dist-check:     0.6s | 5.1% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.2s | 37.4% 

        SFCPack:       0.1s | 0.5% 

                PDATA G2C:     0.0s | 0.2% 442.03 MiB/s 

                Self:          0.0s | 0.3% 

        Self:          0.1s | 0.8%

O(N^2) nlist

Average TPS: 4590.57

Simulation:    10.9s | 100.0% 

        Integrate:     3.2s | 29.8% 

                NVT step 1:     0.6s | 5.6% 

                        PDATA C2G:     0.0s | 0.2% 421.10 MiB/s 

                        Self:          0.6s | 5.4% 

                NVT step 2:     0.7s | 6.0% 

                Net force:      0.6s | 5.4% 

                Thermo:         0.4s | 3.5% 

                Self:           1.0s | 9.3% 

        Neighbor:      3.4s | 31.2% 

                compute:        2.7s | 25.1% 

                dist-check:     0.6s | 5.2% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.1s | 37.7% 

        SFCPack:       0.1s | 0.5% 

                PDATA G2C:     0.0s | 0.2% 443.22 MiB/s 

                Self:          0.0s | 0.3% 

        Self:          0.1s | 0.8%

Looks like 1024 is near the break even point here - the two nlist builds are taking a similar amount of time. You can find this nlist implementation in the code for hoomd as well, it uses the standard shmem sliding window technique to reduce global memory bandwidth demand and keep everything coalesced.

The real trouble with running only 1024 particles (w/ 1024 threads) is that this is nowhere close to fully occupying the GPU - it at most runs on an MP or two. Thus, you can double (or in this case, quadruple! the system size and most of the kernels take the same amount of time)

Binned nlist for 4096 particles:

Average TPS: 3772.61

Simulation:    13.3s | 100.0% 

        Integrate:     3.5s | 26.4% 

                NVT step 1:     0.7s | 5.2% 

                        PDATA C2G:     0.0s | 0.3%   1.05 GiB/s 

                        Self:          0.6s | 4.9% 

                NVT step 2:     0.7s | 5.3% 

                Net force:      0.6s | 4.8% 

                Thermo:         0.4s | 2.9% 

                Self:           1.1s | 8.3% 

        Neighbor:      4.9s | 37.3% 

                Cell:           0.2s | 1.4% 

                        compute:     0.1s | 0.9% 

                        init:        0.0s | 0.0% 

                        Self:        0.1s | 0.5% 

                compute:        4.0s | 30.5% 

                dist-check:     0.6s | 4.6% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.5s | 34.1% 

        SFCPack:       0.2s | 1.5% 

                PDATA G2C:     0.0s | 0.3% 805.03 MiB/s 

                Self:          0.2s | 1.2% 

        Self:          0.1s | 0.8%

If you are truly interested in running only tiny systems at the maximum, I’d suggest looking into other parallelization strategies than one particle per thread. RUMD advertises that it is optimized for small systems, and OpenMM claims to as well. I’ve never looked into the details myself. If I were to go down this direction, I’d try to run one thread per particle pair for the energy/force evaluation and forget about the neighbor list entirely. At this stage, you worry about your integration steps becoming extremely high overhead (as seen in the profiles).

Well, good luck with that. I’ve toyed around with this idea but never got past the back-of-the-envelope calculation stage. MD is so serial from step to step that there are few, if any, opportunities for async memcopies and the PCIe bandwidth is so slow (relative to on-device) that it always slows the computation down. NAMD is the only code I’m aware of that does this well, and they are targeting multi-million particle simulations.

I’d love to see your solution to this problem!

Topic		Replies	Views
CUDA & Smoothed Particle Hydrodynamics Best approach? CUDA Programming and Performance	19	9995	June 24, 2009
Optimization of Molecular Dynamics Code. (S CUDA Programming and Performance	14	7531	October 16, 2009
Optimizing nearest-neighbors querying CUDA Programming and Performance	20	6182	September 17, 2018
Range search in cuda? CUDA Programming and Performance	12	6587	June 23, 2009
How would you do this? CUDA Programming and Performance	12	4471	August 5, 2008
Is this a task in which CUDA could speed up things? CUDA Programming and Performance	11	11779	September 29, 2009
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16335	January 30, 2011
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37460	August 30, 2009
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6052	December 8, 2008
How to optimize this code (whirlpool hash)? Independant parallel hashing with whirlpool is way too s CUDA Programming and Performance	28	3003	September 14, 2010

MD - Neighbour List Update How to re-construct (update) Neighbour Lists on GPU

Related topics