I don’t know how they do the marketing numbers, but if you look at this thread (http://forums.nvidia.com/index.php?showtopic=69801), you will see performance figures for an optimized FFT that are as high as 298 GFLOPS on a GTX280. The highest-end Tesla has the equivalent of 4x GTX280’s, so you’re looking at about 1.2 TFLOPS there, minus some overhead for transfer.

So, perhaps a 100x speedup is feasible, but I don’t know about 250x, unless they were comparing it to numbers from an older processor (such as a Pentium IV).

Well, it all depends on what is the configuration of the PC. :-)

Possibly because TESLA has 240 processing cores @ close to around 1.5GHz. Since CUDA can hide lots of latencies, the combined computing capacity of these processors can be much higher than 240.

well I have water-cooled 280s which are oc’d (680 GHz base clock), so I believe I have about as fast gpus as they come.
I also believe the benchmark nbody in sdk is written v. carefully to optimize performance and the only uncertain thing is how to define FLOP in the face of instructions such as inv erse square root, or multiply-and-add. the benchmark’s authors decided that extra speed from m&a cancels the disadvantage of the expensive sqrt, so they say their algorithm, having the critical part with 20 floating point operations including all of the above, will be counted as 20 FLOPs. then my one gpu does about 470 GFLOP/s. but that’s abit unfair (10% unfair? more? I guess more…) since the computation can be slowed down by graphics and other chores as well.

historically, people considered sqrt an equivalent of 10+ FLOPs, which I think is still a good way to quantify performance.
then my gpu has 650+ GFLOPs, and my machine about 2 TFLOP/s.

now, what about the cpu? of course, IF you are able to compute efficiently on 4 cores of an intel processor, then you can probably get 20 or so GFLOP/s. I think that’s a fair number. nvidia must have compared a single-core cpu and a gtx, which isn’t quite fair, although in a particular setting, like most existing supercomputing clusters, it actually is fair because they don’t have quad-core cpu’s yet.
but a factor of 2000/20 = 100x cpu is not bad at all.

on the other hand, in order to do an nbody calculation, I would have to run not just 100 or 200 cpus, but a much larger number of nodes on a cluster, because nbody is a well-connected problem and bandwidth via the typical inteconnect is horrible so the calclulation would be communication-bound.

conclusion: you can argue different ways and get different numbers but one thing is sure: cuda solution is fantastic for things you can port to cuda.
it can be claimed to be from 10 to many hundreds of times more powerful than other ways of computing.

Actually the performance gap is not even that large. The Intel MKL achieves ~16 GFlops for FFTs of length ~1024 on a single core. So it is closer to a factor of 20.

Peak performance is just a rough indication. Actual application performance can be way off.

This is a problem that has been always discussed for websites like top500.org. In their case they use Linpack as a measure of performance between the system. Yet is roadrunner going to run only Linpack and give 1 Petaflops for real life application? In general, no.

Sadly without actually thinking about the way you are going to rewrite a code for the GPU or any parallel machine, you cannot pull out numbers and pretend they are going to be accurate…

A direct n^2 summation is unlikely to be communication bound even on a cluster with crappy interconnects.

Since we are talking about a direct n^2 summation, you could duplicate all particles on all processors. If you actually had so many particles you couldn’t fit them all into the memory of one node, your calculation would take hopelessly long no matter how you did it. I would estimate you could fit 30 million particles into 2GB of RAM, thats almost 10^15 interactions. You would need a cluster the size of Blue Gene L to tackle that in a reasonable amount of time. Or a cluster with 1000 gpus.

Even if you did need to break the particles up into N (where N is almost certainly <10) groups, the cost of shuffling them around would still be dwarfed by the compute cost.

Then all you would have to transfer would be a broadcast of N forces. If you wanted to get really fancy you could start broadcasting forces as each particle completed (or a small group of them) and almost completely eliminate communication overhead.