Molecular dynamics How to seedup molecular modeling on GPU?

I’d like to discuss possibilities to reach hi performance in molecular dynamics on CUDA technology. Astronomers declare 2 order speedup to their tacks. Unfortunately in molecular calculations situation is not so good. Problems originate form the fact than not all pairs of particles should be calculated. For example, particles out cutoff radius should be ignored; covalently bound atoms should be ignored.

The usual decision on CPU is using pair lists. But working with pair list on GPU is very slow due to slow memory access.

There are some programs that made molecular dynamic calculation but it seems to me that all of them are problematic:

NAMD has very interesting algorithm but it look as a not universal, appropriate only for polymeric molecules. Moreover it is cell based but not pair lists based.

Ascalaph archive only 18 times speedup (no pair lists).

HOOMD claim 15x

Folding@home declares 50 times speedup. But their codes are not available.

Are there some ideas how to make something like pair lists on the video card and reach proportional speedup?

Where did you get 15x from? The speedup is roughly 30x for an 8800 GTX vs an one core of an Opteron 285 when running a standard Lennard-Jones liquid…

Still, in the end, comparing “speedups” between various apps means little. First, different authors use different benchmark systems and different baselines from which the speedup is measured. Absolute performance measured as the amount of work done per second on the same benchmark system is what matters. Though, I don’t know that this comparison can be done for all the MD codes you listed since they all target different types of simulations. (NAMD->biomolecules, HOOMD->general purpose with an emphasis on coarse-grained, Ascalaph -> ??, Folding@home -> proteins).

And Folding@home doesn’t use pair lists either. They do the full N^2 force sum, though they have a trick for implementing newton’s third law. O(1/2 N^2) is still O(N^2) as far as scaling goes, though.

As far as pair lists are concerned: HOOMD is the only MD I know of that generates pair lists on the GPU. (Well, there is one recently published paper where they also calculated the pair list, but it used an O(N^2) algorithm for that step so it was very slow.) The current development version of HOOMD has the pair list generation performance boosted ~50% faster than the version in the paper. Both the pair list generation and pair force sum now hit device memory bandwidth limitations => they can’t go any faster.

At least they can’t go any faster without better caching (hint hint NVIDIA: make bigger texture caches!). Shared memory could be used, but then you need a much more regular access pattern which brings you right back to the cell type data structures that NAMD uses which has its own headaches. My attempts at working with that have yielded ~1/3 the performance of pair lists as implemented in HOOMD.

Anyway, to summarize a really long post: you got it right when you said

With the astro simulations, they have very little memory access and lots of FLOPS: so they can leverage a speedup of FLOPS_gpu / FLOPS_cpu = ~100 in an ideal world. O(N) MD with pair lists (or cell lists) performs relatively few FLOPS for each memory read and thus the performance is bounded by memory. In an ideal world the fastest speedup a GPU MD can achieve is GPU_bandwidth / CPU_bandwidth = 86.4 / 6.4 = 13.5. <— Since HOOMD gets 30x, that just goes to show that the CPU based MD’s aren’t using memory at the full bandwidth available (because of the random access pattern). By this simple argument, memory bound calculations on the GPU will always be an order of magnitude less speedup than FLOPs bound calculations.

If you have any questions on the particulars of the pair lists in HOOMD, I’d be happy to answer.

Thank you for so detailed answer.

I take it from…30-dfc5b74f986c

Could your give a reference on it?

Thank your very much. I will look at HOOMD code first.

Hmm, I’ll have to have them update the CUDAZone speedup number. They list our paper as a speedup of 30x, but the HOOMD software has a separate entry with 15x, odd.

The only reference I’m aware of that mentions the Folding@Home CUDA implementation is
“Parallel Computing Experiences With CUDA”
Michael Garland, et al.
to be published in IEEE Micro

Which is apparently not published yet: I didn’t realize that. It should be in the next publication.

That I see at first glance in HOOMD is that they all realized on CUDA. It explains 30x largely. Ascalaph calculate on CUDA only van der Waals. In detail I did not study a code yet . If you like we can carry a discussion into e-mail.
My address on the site of

Hoomd’s lj_liquid benchmark page

says that a gtx680 can compute 1 timestep(double-precision) in just 5 milliseconds(10e6 timesteps in 15 hours) and this is on 64000 atoms with cutoff=3.0(nanometers) sigma=1.0(nanometers)

GTX680 has (1536 cores) * (1/24 sp/dp ratio) * (2 flops) * (1GHz) * (50/100 efficiency) = 64 gflop per second

Assuming each pair force calculation contains 16 flops, each timestep compute needs

(64000 atoms) * (16 flops) * (2000 neighbors per atom) = 2 gflop total

2 gflop / 64gflops = 31 milliseconds (benchmark says it should be 5 milliseconds (10e6 timesteps in 15 hours))

So at least one of these must be true(to go from 31 milliseconds to 5 miliseconds):

  • cutoff does not span that many atoms (2000 neighbors) when it is 3.0 (nanometers) but much less like 500 neighbor atoms per atom (when each atom’s Rmin(distance at minimum potential, equilibrium) is 0.2 (nanometers) so 2000 marbles per 3d cutoff scan)?
  • each Lennard Jones operation is less than 16 flops, like only 3-4 flops (??? even only distance between atoms is 3 flops)?
  • algorithm doesn’t compute a timestep always (?? do atoms assumed to be stationary when dt is small?) Then does it benchmark only neighbor-list building times?
  • gtx 680 gets %300 efficiency boost somehow, when doing double precision, so its like 1/4 sp/dp ratio. Also looking at sp benchmark results, its just a bit lower than 1/4.
  • really memory bottlenecked so it doesn’t unlock so much performance by enabling single precision floats(does this imply 1st option too? Because high neighbours would make more compute bottleneck wouldn’t it?)
  • a sum of a bit from all options above

Maybe those marbles in benchmark page represent cutoff range rather than r_minpotential or sigma? So that there are only 26 neighbors and is memory bottlenecked?