MD - Neighbour List Update How to re-construct (update) Neighbour Lists on GPU

Yes, indeed. Check out CellListGPU.cu in the current hoomd trunk for a very fast implementation of the cell list generation on the GPU. On G80 cards, going back to the CPU was a minimal overhead (~a few %), but on Fermi the rest of the code got so much faster that going back to the CPU for the cell list was approaching 15% of the run time - so a GPU solution was needed.

On Fermi cards with atomic ops in L2 cache, it performs very fast. I’m in the process of preparing a subsequent publication with these new algorithms, but for now they are already released and available (and you can always ask me questions about them). The new neighbor list code now only runs 1 particle per thread and relies on the cache hierarchy to efficiently read in the data - the bottleneck in this code is actually the write operation!

As with many O(N) vs O(NN) algorithm comparisons, there will likely be a break even point at which the O(NN) algorithm is faster for small N. I haven’t extensively benchmarked this since the G80 days, but since you ask about a specific system size (1024 particles) I ran a quick benchmark in hoomd of a lj liquid on a GTX 580.

O(N) nlist

Average TPS: 4434.61

Simulation:    11.3s | 100.0% 

        Integrate:     3.3s | 29.1% 

                NVT step 1:     0.6s | 5.5% 

                        PDATA C2G:     0.0s | 0.2% 419.04 MiB/s 

                        Self:          0.6s | 5.3% 

                NVT step 2:     0.7s | 5.8% 

                Net force:      0.6s | 5.4% 

                Thermo:         0.4s | 3.5% 

                Self:           1.0s | 9.0% 

        Neighbor:      3.6s | 32.1% 

                Cell:           0.2s | 1.4% 

                        compute:     0.1s | 0.9% 

                        init:        0.0s | 0.0% 

                        Self:        0.1s | 0.5% 

                compute:        2.8s | 24.8% 

                dist-check:     0.6s | 5.1% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.2s | 37.4% 

        SFCPack:       0.1s | 0.5% 

                PDATA G2C:     0.0s | 0.2% 442.03 MiB/s 

                Self:          0.0s | 0.3% 

        Self:          0.1s | 0.8%

O(N^2) nlist

Average TPS: 4590.57

Simulation:    10.9s | 100.0% 

        Integrate:     3.2s | 29.8% 

                NVT step 1:     0.6s | 5.6% 

                        PDATA C2G:     0.0s | 0.2% 421.10 MiB/s 

                        Self:          0.6s | 5.4% 

                NVT step 2:     0.7s | 6.0% 

                Net force:      0.6s | 5.4% 

                Thermo:         0.4s | 3.5% 

                Self:           1.0s | 9.3% 

        Neighbor:      3.4s | 31.2% 

                compute:        2.7s | 25.1% 

                dist-check:     0.6s | 5.2% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.1s | 37.7% 

        SFCPack:       0.1s | 0.5% 

                PDATA G2C:     0.0s | 0.2% 443.22 MiB/s 

                Self:          0.0s | 0.3% 

        Self:          0.1s | 0.8%

Looks like 1024 is near the break even point here - the two nlist builds are taking a similar amount of time. You can find this nlist implementation in the code for hoomd as well, it uses the standard shmem sliding window technique to reduce global memory bandwidth demand and keep everything coalesced.

The real trouble with running only 1024 particles (w/ 1024 threads) is that this is nowhere close to fully occupying the GPU - it at most runs on an MP or two. Thus, you can double (or in this case, quadruple! the system size and most of the kernels take the same amount of time)

Binned nlist for 4096 particles:

Average TPS: 3772.61

Simulation:    13.3s | 100.0% 

        Integrate:     3.5s | 26.4% 

                NVT step 1:     0.7s | 5.2% 

                        PDATA C2G:     0.0s | 0.3%   1.05 GiB/s 

                        Self:          0.6s | 4.9% 

                NVT step 2:     0.7s | 5.3% 

                Net force:      0.6s | 4.8% 

                Thermo:         0.4s | 2.9% 

                Self:           1.1s | 8.3% 

        Neighbor:      4.9s | 37.3% 

                Cell:           0.2s | 1.4% 

                        compute:     0.1s | 0.9% 

                        init:        0.0s | 0.0% 

                        Self:        0.1s | 0.5% 

                compute:        4.0s | 30.5% 

                dist-check:     0.6s | 4.6% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.5s | 34.1% 

        SFCPack:       0.2s | 1.5% 

                PDATA G2C:     0.0s | 0.3% 805.03 MiB/s 

                Self:          0.2s | 1.2% 

        Self:          0.1s | 0.8%

If you are truly interested in running only tiny systems at the maximum, I’d suggest looking into other parallelization strategies than one particle per thread. RUMD advertises that it is optimized for small systems, and OpenMM claims to as well. I’ve never looked into the details myself. If I were to go down this direction, I’d try to run one thread per particle pair for the energy/force evaluation and forget about the neighbor list entirely. At this stage, you worry about your integration steps becoming extremely high overhead (as seen in the profiles).

Well, good luck with that. I’ve toyed around with this idea but never got past the back-of-the-envelope calculation stage. MD is so serial from step to step that there are few, if any, opportunities for async memcopies and the PCIe bandwidth is so slow (relative to on-device) that it always slows the computation down. NAMD is the only code I’m aware of that does this well, and they are targeting multi-million particle simulations.

I’d love to see your solution to this problem!