MD - Neighbour List Update How to re-construct (update) Neighbour Lists on GPU

Hi Everyone,

I am working on a different approach to the molecular dynamic simulations using CUDA and GPU’s. Most of the main parts are already completed such as:
Random Set Generation - ON CPU
Neighbor List construction - On CPU
Potential Calculations - On GPU

  • List Update ( UNKNOWN )-
    Velocity Verlet Integration - On GPU

In all of the above parts of the program, I found a way to provide coalesced memory access (almost) but I just could not find a way to efficiently map the neighbor list update part on the GPU. Since according to my project context CPU will be busy doing different things so I can not pass the data to the CPU to
make it update neighbor list. If there exist anyone out there that has previously developed MD simulations with CUDA, I would be glad if he can provide some advises about how to update the list efficiently. ( a code piece would be so useful at least a paper that describes how to do it clearly)

Thanks a lot,

HOOMD-blue has the fastest neighborlist construction implementation that I am aware of. Its open source, you can get it from here: http://codeblue.umich.edu/hoomd-blue/ Look in the file NeighborListBinnedGPU.cu and CellListGPU.cu

It is a relatively straightforward implementation of the standard cell list -> neighborlist construction typically used on CPUs. The cell list is structured so that the way the neighbor list reads through them produces 95%+ L1 cache hits on Fermi.

You can also read my paper on the subject ( http://dx.doi.org/10.1016/j.jcp.2008.01.047 ), but keep in mind that it is 3 years old now and was written with G80 hardware in mind! The neighborlist code in particular has been completely rewritten and is now significantly faster. The old version did maintain 100% coalescing on the old hardware, but doing so demanded that the block size was always larger than the largest cell occupancy which resulted in many wasted threads - especially in simulations with small r_cut where a typical cell occupancy is only 2-4.

Is there a particular reason that you are rolling your own GPU MD code? There are numerous well-established GPU MD applications in existence already. HOOMD-blue is very general-purpose and may be able to run what you need out of the box.

Thanks Dr.Anderson, I already inspected your paper and it was an excellent guide to MD on CUDA. But as you said during the neighbor list construction part it requires too many wasted threads which is a little undesirable in my application context. In addition to that it sends back the data to the CPU to make the binning process.(I think the newer cards that I have will be able to do this efficiently since it has atomic read-modify-write ability).

I am thinking that maybe rather than just messing with cells and other data configurations (because they are producing significant overhead) I can directly calculate N*N calculations although they won’t be coalesced. I will run the MD code at most with 1024 molecules and I think that maybe it may be much profitable to directly calculate rather than creating data structures and introducing a overhead of managing them. What do you think about it?

What I am trying to do as my thesis work(that is why I can’t use the Hoomd-blue, my supervisor won’t like it), I will try to overlap the execution of CPU and GPU by providing a planar division of the molecular test space. Basically, I am trying to divide the data statistically as good as possible to send a part to the GPU while I send the other part to the CPU (and possibly to FPGA or any computational unit)to make different computational resources used with maximum efficiency.

Yes, indeed. Check out CellListGPU.cu in the current hoomd trunk for a very fast implementation of the cell list generation on the GPU. On G80 cards, going back to the CPU was a minimal overhead (~a few %), but on Fermi the rest of the code got so much faster that going back to the CPU for the cell list was approaching 15% of the run time - so a GPU solution was needed.

On Fermi cards with atomic ops in L2 cache, it performs very fast. I’m in the process of preparing a subsequent publication with these new algorithms, but for now they are already released and available (and you can always ask me questions about them). The new neighbor list code now only runs 1 particle per thread and relies on the cache hierarchy to efficiently read in the data - the bottleneck in this code is actually the write operation!

As with many O(N) vs O(NN) algorithm comparisons, there will likely be a break even point at which the O(NN) algorithm is faster for small N. I haven’t extensively benchmarked this since the G80 days, but since you ask about a specific system size (1024 particles) I ran a quick benchmark in hoomd of a lj liquid on a GTX 580.

O(N) nlist

Average TPS: 4434.61

Simulation:    11.3s | 100.0% 

        Integrate:     3.3s | 29.1% 

                NVT step 1:     0.6s | 5.5% 

                        PDATA C2G:     0.0s | 0.2% 419.04 MiB/s 

                        Self:          0.6s | 5.3% 

                NVT step 2:     0.7s | 5.8% 

                Net force:      0.6s | 5.4% 

                Thermo:         0.4s | 3.5% 

                Self:           1.0s | 9.0% 

        Neighbor:      3.6s | 32.1% 

                Cell:           0.2s | 1.4% 

                        compute:     0.1s | 0.9% 

                        init:        0.0s | 0.0% 

                        Self:        0.1s | 0.5% 

                compute:        2.8s | 24.8% 

                dist-check:     0.6s | 5.1% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.2s | 37.4% 

        SFCPack:       0.1s | 0.5% 

                PDATA G2C:     0.0s | 0.2% 442.03 MiB/s 

                Self:          0.0s | 0.3% 

        Self:          0.1s | 0.8%

O(N^2) nlist

Average TPS: 4590.57

Simulation:    10.9s | 100.0% 

        Integrate:     3.2s | 29.8% 

                NVT step 1:     0.6s | 5.6% 

                        PDATA C2G:     0.0s | 0.2% 421.10 MiB/s 

                        Self:          0.6s | 5.4% 

                NVT step 2:     0.7s | 6.0% 

                Net force:      0.6s | 5.4% 

                Thermo:         0.4s | 3.5% 

                Self:           1.0s | 9.3% 

        Neighbor:      3.4s | 31.2% 

                compute:        2.7s | 25.1% 

                dist-check:     0.6s | 5.2% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.1s | 37.7% 

        SFCPack:       0.1s | 0.5% 

                PDATA G2C:     0.0s | 0.2% 443.22 MiB/s 

                Self:          0.0s | 0.3% 

        Self:          0.1s | 0.8%

Looks like 1024 is near the break even point here - the two nlist builds are taking a similar amount of time. You can find this nlist implementation in the code for hoomd as well, it uses the standard shmem sliding window technique to reduce global memory bandwidth demand and keep everything coalesced.

The real trouble with running only 1024 particles (w/ 1024 threads) is that this is nowhere close to fully occupying the GPU - it at most runs on an MP or two. Thus, you can double (or in this case, quadruple! the system size and most of the kernels take the same amount of time)

Binned nlist for 4096 particles:

Average TPS: 3772.61

Simulation:    13.3s | 100.0% 

        Integrate:     3.5s | 26.4% 

                NVT step 1:     0.7s | 5.2% 

                        PDATA C2G:     0.0s | 0.3%   1.05 GiB/s 

                        Self:          0.6s | 4.9% 

                NVT step 2:     0.7s | 5.3% 

                Net force:      0.6s | 4.8% 

                Thermo:         0.4s | 2.9% 

                Self:           1.1s | 8.3% 

        Neighbor:      4.9s | 37.3% 

                Cell:           0.2s | 1.4% 

                        compute:     0.1s | 0.9% 

                        init:        0.0s | 0.0% 

                        Self:        0.1s | 0.5% 

                compute:        4.0s | 30.5% 

                dist-check:     0.6s | 4.6% 

                Self:           0.1s | 0.9% 

        Pair lj:       4.5s | 34.1% 

        SFCPack:       0.2s | 1.5% 

                PDATA G2C:     0.0s | 0.3% 805.03 MiB/s 

                Self:          0.2s | 1.2% 

        Self:          0.1s | 0.8%

If you are truly interested in running only tiny systems at the maximum, I’d suggest looking into other parallelization strategies than one particle per thread. RUMD advertises that it is optimized for small systems, and OpenMM claims to as well. I’ve never looked into the details myself. If I were to go down this direction, I’d try to run one thread per particle pair for the energy/force evaluation and forget about the neighbor list entirely. At this stage, you worry about your integration steps becoming extremely high overhead (as seen in the profiles).

Well, good luck with that. I’ve toyed around with this idea but never got past the back-of-the-envelope calculation stage. MD is so serial from step to step that there are few, if any, opportunities for async memcopies and the PCIe bandwidth is so slow (relative to on-device) that it always slows the computation down. NAMD is the only code I’m aware of that does this well, and they are targeting multi-million particle simulations.

I’d love to see your solution to this problem!

Yes. I was actually thinking about the occupancy issue so I think I will increase the molecule amount to at least 4096 (or to 8000). In that case I am going to use cell list approach.

Surely, I will gladly share it with you when I completely fix my mind on the issue:)
Thanks for your support it was really helpful.