this is a really general question I have, and perhaps it does not fit in this forum, however I didn’t know where to ask for such an information.
I was working on a GPU implementation of an algorithm. Thinks work, algorithm calulation time was reduced and everything is fine! Nevertheless, now I have to write a report about it and I tried to find a recent comparison between CPUs and GPUs.
I found a chart that compares the CPUs development with the GPU from 2003 to 2006, however this is two years ago, so imensive development took place in the meanwhile. I mean isn’t there anythink out that compares the new Tesla 10 series or the 200series with the new Intel processors???
I had a look to recent articles published on nvidia.com and searched for it on google, but I couldn’t find anything, most article just focus on the passed T-gflops limit, but don’t put in in comparison.
So perhaps some of you has such a chart or knows where to find it!
Perhaps you should also note that this comparision is good for marketting, but it’s totally irrelevant otherwise. Instead of writting a report were you compare those, perhaps you should just discuss how pointless it is :)
@spg - thx, that is exactly what I was looking for, didn’t come up to me to search in the programming guide, even I was working with this document for weeks. :)
I know, that this graph doesn’t show the real computing difference for programs and as it is done by NVIDIA, it is pure marketing. However this is about the introduction to the issue and for that it is a good start The aim of the project is not just about speed but also about quality, implementation and if gpus are already prepared for general purpose programming… so it is discussed if all this marketing hype can hold what it promises…
Of course. But in any published paper, even in scientific journals, the first and last pages must be marketing. a) Most people read the intro and skip to the conclusions and B) If they don’t see marketing there, they ignore your paper as pointless. It’s just the way the culture works and those few sensible people out there (like us) can’t change the entire culture.
Anyways… the update to the GFLOP/s graph is nice in the new 2.0 guide. And I like how they updated it with a mem bandwidth graph, too (since that is the only number I care about). It is funny, though. Their mem bandwidth graph only goes up to G80 ultra… I guess they didn’t want to show the bandwidth drop in G92 :) Just another example of culture and marketing. But they could have included the bandwidth for G200.
At least for the GPUs, you can get any of these numbers for yourself to make a prettier plot just by browsing through the Specifications tables at www.nvidia.com. I’m not sure where to find such nicely organized data for CPUs, though. It always seems like a PITA to find theoretical GFLOP/s numbers for CPUs.
Isn’t theoretical Flop/s for CPUs just (16 bytes/sizeof(datatype you care about–float or double))(number of cores)(frequency)? (first entry is to take into account SSE)
You need not advertise theoretical performance that has absolutely no meaning, aka. don’t feed the troll with your publications :). This reminds me some performance figure on the Cell where people could not really agree on a theoretical bus bandwith, while it was pretty clear that there existed a limit that was typically hit by realistic applications. Substained GFlops may thus make a little more sense if they are taken with great care …
Just like processors which are often compared using BLAS or such kernels, perhaps it is more interesting to reference the best implementation of such and such kernels (typically gemm) to give order of magnitudes… such figures are pretty common in litterature. Once again, they have very little meaning, but i personnaly think they are less bad than “nvidia promised me i would get 10TFlops”.
Hopefuly (or not), i’m convinced we will soon or later have some LINPACK, or one of those spec* test suites … anyone have heard about such a thing by now ?
I definitely agree that finding a proper, and more or less reliable, theoretical performance analysis is a real pain, either on a CPU where we know almost everything of the underlying implementation, and even worse on GPUs where we know much less.
I suppose nvidia figures also result from the hypothesis that all ALUs are used during all cycles, with no memory stalls and so on ?
Anyway, i’m just being picky, but as everyone here seems to take caution with those numbers, there is no point in being dense anymore ;)
Trust me, I tried. It was impossible to convince the other authors of the paper (who were not programming/hardware experts) not to include it. It was insisted that we needed something in the abstract/introduction that the layperson could understand as an explanation of why the heck we were even considering going through all this effort. And someone who has heard anything at all about HPC has certainly heard about the Top 500 and the race for more GFLOP/s, even if they have no real understanding of what it means.
Yeah. There is a benchmark floating around the forums somewhere that gets very close to this peak (for compute 1.x hardware), so the hardware is actually capable of achieving it. Of course, doing so requires thousands of MAD operations one after the other in each thread. I haven’t seen this benchmark updated for the G200 chips, it would need to be modified to do a MADD and a MUL every tick.
There is something to be said for the theoretical memory bandwidth on these GPUs, though. With coalesced accesses in CUDA, it is relatively easy to attain 80% of the theoretical peak across a wide variety of algorithms.
I think you become critical about published figures quite fast when you work in this field. Of course I searched for publications that did similar implementatios and I found quiete a few, however you really have to read bewteen the lines to see how the gain factors are achieved. One common trick is to compare the double CPU implementation with the float GPU one, in my implementation this can slow down the CPU version up to 30%.
Slide 5 of the Siggraph CUDA session (http://developer.nvidia.com/object/siggraph-2008-CUDA.html) has the GT200 (GTX280) bandwidth included. The FSB curve stops in 2007 since it has been at 1600 MHz since (and QP wasn’t available at the time).
An elaboration on the bandwidth figure. Both the FSB and GPU bandwidths are theoretical peaks (bus width x clock).
Also, the FSB bandwidth is read only bandwidth. Sometimes you’ll see quotes for FSB that add the read and write bandwidths, coming up with a larger number, but the vast majority of apps are read-bandwidth limited. Also, FSB write bandwidth is slower than the read bandwidth (I want to say 75%, but that I’m not positive about), at least prior to the latest version.