Where did you get 15x from? The speedup is roughly 30x for an 8800 GTX vs an one core of an Opteron 285 when running a standard Lennard-Jones liquid…
Still, in the end, comparing “speedups” between various apps means little. First, different authors use different benchmark systems and different baselines from which the speedup is measured. Absolute performance measured as the amount of work done per second on the same benchmark system is what matters. Though, I don’t know that this comparison can be done for all the MD codes you listed since they all target different types of simulations. (NAMD->biomolecules, HOOMD->general purpose with an emphasis on coarse-grained, Ascalaph -> ??, Folding@home -> proteins).
And Folding@home doesn’t use pair lists either. They do the full N^2 force sum, though they have a trick for implementing newton’s third law. O(1/2 N^2) is still O(N^2) as far as scaling goes, though.
As far as pair lists are concerned: HOOMD is the only MD I know of that generates pair lists on the GPU. (Well, there is one recently published paper where they also calculated the pair list, but it used an O(N^2) algorithm for that step so it was very slow.) The current development version of HOOMD has the pair list generation performance boosted ~50% faster than the version in the paper. Both the pair list generation and pair force sum now hit device memory bandwidth limitations => they can’t go any faster.
At least they can’t go any faster without better caching (hint hint NVIDIA: make bigger texture caches!). Shared memory could be used, but then you need a much more regular access pattern which brings you right back to the cell type data structures that NAMD uses which has its own headaches. My attempts at working with that have yielded ~1/3 the performance of pair lists as implemented in HOOMD.
Anyway, to summarize a really long post: you got it right when you said
With the astro simulations, they have very little memory access and lots of FLOPS: so they can leverage a speedup of FLOPS_gpu / FLOPS_cpu = ~100 in an ideal world. O(N) MD with pair lists (or cell lists) performs relatively few FLOPS for each memory read and thus the performance is bounded by memory. In an ideal world the fastest speedup a GPU MD can achieve is GPU_bandwidth / CPU_bandwidth = 86.4 / 6.4 = 13.5. <— Since HOOMD gets 30x, that just goes to show that the CPU based MD’s aren’t using memory at the full bandwidth available (because of the random access pattern). By this simple argument, memory bound calculations on the GPU will always be an order of magnitude less speedup than FLOPs bound calculations.
If you have any questions on the particulars of the pair lists in HOOMD, I’d be happy to answer.