Molecular dynamics How to seedup molecular modeling on GPU?

niki · August 22, 2008, 9:44pm

Iâ€™d like to discuss possibilities to reach hi performance in molecular dynamics on CUDA technology. Astronomers declare 2 order speedup to their tacks. Unfortunately in molecular calculations situation is not so good. Problems originate form the fact than not all pairs of particles should be calculated. For example, particles out cutoff radius should be ignored; covalently bound atoms should be ignored.

The usual decision on CPU is using pair lists. But working with pair list on GPU is very slow due to slow memory access.

There are some programs that made molecular dynamic calculation but it seems to me that all of them are problematic:

NAMD [url=“GPU Acceleration of Molecular Modeling Applications”]http://www.ks.uiuc.edu/Research/gpu/[/url] has very interesting algorithm but it look as a not universal, appropriate only for polymeric molecules. Moreover it is cell based but not pair lists based.

Ascalaph [url=“http://www.agilemolecule.com/Ascalaph/Ascalaph-Liquid.html”]http://www.agilemolecule.com/Ascalaph/Ascalaph-Liquid.html[/url] archive only 18 times speedup (no pair lists).

HOOMD [url=“http://www.external.ameslab.gov/hoomd/download_real.html”]http://www.external.ameslab.gov/hoomd/download_real.html[/url] claim 15x

Folding@home [url=“Folding@home”]http://folding.typepad.com/news/[/url] declares 50 times speedup. But their codes are not available.

Are there some ideas how to make something like pair lists on the video card and reach proportional speedup?

MisterAnderson42 · August 22, 2008, 11:37pm

Where did you get 15x from? The speedup is roughly 30x for an 8800 GTX vs an one core of an Opteron 285 when running a standard Lennard-Jones liquid…

Still, in the end, comparing “speedups” between various apps means little. First, different authors use different benchmark systems and different baselines from which the speedup is measured. Absolute performance measured as the amount of work done per second on the same benchmark system is what matters. Though, I don’t know that this comparison can be done for all the MD codes you listed since they all target different types of simulations. (NAMD->biomolecules, HOOMD->general purpose with an emphasis on coarse-grained, Ascalaph → ??, Folding@home → proteins).

And Folding@home doesn’t use pair lists either. They do the full N^2 force sum, though they have a trick for implementing newton’s third law. O(1/2 N^2) is still O(N^2) as far as scaling goes, though.

As far as pair lists are concerned: HOOMD is the only MD I know of that generates pair lists on the GPU. (Well, there is one recently published paper where they also calculated the pair list, but it used an O(N^2) algorithm for that step so it was very slow.) The current development version of HOOMD has the pair list generation performance boosted ~50% faster than the version in the paper. Both the pair list generation and pair force sum now hit device memory bandwidth limitations => they can’t go any faster.

At least they can’t go any faster without better caching (hint hint NVIDIA: make bigger texture caches!). Shared memory could be used, but then you need a much more regular access pattern which brings you right back to the cell type data structures that NAMD uses which has its own headaches. My attempts at working with that have yielded ~1/3 the performance of pair lists as implemented in HOOMD.

Anyway, to summarize a really long post: you got it right when you said

With the astro simulations, they have very little memory access and lots of FLOPS: so they can leverage a speedup of FLOPS_gpu / FLOPS_cpu = ~100 in an ideal world. O(N) MD with pair lists (or cell lists) performs relatively few FLOPS for each memory read and thus the performance is bounded by memory. In an ideal world the fastest speedup a GPU MD can achieve is GPU_bandwidth / CPU_bandwidth = 86.4 / 6.4 = 13.5. <— Since HOOMD gets 30x, that just goes to show that the CPU based MD’s aren’t using memory at the full bandwidth available (because of the random access pattern). By this simple argument, memory bound calculations on the GPU will always be an order of magnitude less speedup than FLOPs bound calculations.

If you have any questions on the particulars of the pair lists in HOOMD, I’d be happy to answer.

niki · August 23, 2008, 12:26am

Thank you for so detailed answer.

I take it from http://www.nvidia.com/object/cuda_home.htm…30-dfc5b74f986c

Could your give a reference on it?

Thank your very much. I will look at HOOMD code first.

MisterAnderson42 · August 23, 2008, 12:56pm

Hmm, I’ll have to have them update the CUDAZone speedup number. They list our paper as a speedup of 30x, but the HOOMD software has a separate entry with 15x, odd.

The only reference I’m aware of that mentions the Folding@Home CUDA implementation is
“Parallel Computing Experiences With CUDA”
Michael Garland, et al.
to be published in IEEE Micro

Which is apparently not published yet: I didn’t realize that. It should be in the next publication.

niki · August 23, 2008, 1:34pm

That I see at first glance in HOOMD is that they all realized on CUDA. It explains 30x largely. Ascalaph calculate on CUDA only van der Waals. In detail I did not study a code yet . If you like we can carry a discussion into e-mail.
My address on the site of [url=“http://www.agilemolecule.com/Contacts.html”]http://www.agilemolecule.com/Contacts.html[/url]

tugrul_192bit · March 31, 2018, 5:19pm

Hoomd’s lj_liquid benchmark page

says that a gtx680 can compute 1 timestep(double-precision) in just 5 milliseconds(10e6 timesteps in 15 hours) and this is on 64000 atoms with cutoff=3.0(nanometers) sigma=1.0(nanometers)

GTX680 has (1536 cores) * (1/24 sp/dp ratio) * (2 flops) * (1GHz) * (50/100 efficiency) = 64 gflop per second

Assuming each pair force calculation contains 16 flops, each timestep compute needs

(64000 atoms) * (16 flops) * (2000 neighbors per atom) = 2 gflop total

2 gflop / 64gflops = 31 milliseconds (benchmark says it should be 5 milliseconds (10e6 timesteps in 15 hours))

So at least one of these must be true(to go from 31 milliseconds to 5 miliseconds):

cutoff does not span that many atoms (2000 neighbors) when it is 3.0 (nanometers) but much less like 500 neighbor atoms per atom (when each atom’s Rmin(distance at minimum potential, equilibrium) is 0.2 (nanometers) so 2000 marbles per 3d cutoff scan)?
each Lennard Jones operation is less than 16 flops, like only 3-4 flops (??? even only distance between atoms is 3 flops)?
algorithm doesn’t compute a timestep always (?? do atoms assumed to be stationary when dt is small?) Then does it benchmark only neighbor-list building times?
gtx 680 gets %300 efficiency boost somehow, when doing double precision, so its like 1/4 sp/dp ratio. Also looking at sp benchmark results, its just a bit lower than 1/4.
really memory bottlenecked so it doesn’t unlock so much performance by enabling single precision floats(does this imply 1st option too? Because high neighbours would make more compute bottleneck wouldn’t it?)
a sum of a bit from all options above

Maybe those marbles in benchmark page represent cutoff range rather than r_minpotential or sigma? So that there are only 26 neighbors and is memory bottlenecked?

Topic		Replies	Views
MD - Neighbour List Update How to re-construct (update) Neighbour Lists on GPU CUDA Programming and Performance	4	3334	July 24, 2011
Molecular dynamics software comparison CUDA Programming and Performance	10	14511	June 14, 2009
Molecular Dynamics Code CUDA Programming and Performance	4	998	September 25, 2019
Announcing Ascalaph GPU Molecular Dynamics with CUDA CUDA Programming and Performance	3	3397	July 2, 2010
Molecular Dynamics Code CUDA Programming and Performance	6	5790	October 16, 2008
Tesla Bio Workbench featuring GPU-accelerated codes in molecular dynamics and more CUDA Programming and Performance	1	17789	July 28, 2010
Molecular dynamics large scale computing grids free for all (except for the power bill) CUDA Programming and Performance	3	2414	July 14, 2010
Using CUDA to improve performance of a Molecular Dynamics program CUDA Programming and Performance cuda	10	628	December 24, 2023
Optimization of Molecular Dynamics Code. (S CUDA Programming and Performance	14	7695	October 16, 2009
My NAMD CUDA expirience thus far GTX 260 192sp CUDA Programming and Performance	13	39558	June 18, 2010

Molecular dynamics How to seedup molecular modeling on GPU?

Related topics