300x to 600x times faster... really?

If you ever take a look at the cuda zone website, you’ll see things like

    [*]300x faster - Furry Ball: GPU renderer for Maya

    [*]600x faster - Parallel Algorithm for Solving Kepler’s Equation

My initial thoughts are that speedups of this magnitude can only come from algorithms that are inherently memory bound… since the g200 can achieve theoretical ~600Glops (MAD) and a quad core can reach >90Gflops (single precision)

…or maybe some comparison’s were not entirely fair such as

    [*]cpu doing double precision while gpu doing single precision

    [*]comparing to a single core cpu when you can get quad core for a similar cost

    [*]comparing with unoptimized cpu program, but the gpu program has been carefully optimized

any thoughts?

My own bias usually lies with the third term. You spend 3 weeks carefuly thinking everything about the GPU version and then you slap together some CPU equivalent.

I usually compare with a single core implementation and mention it in the discussion. People can divide by 4 on their own. If anything it will give an advantage to the CPU version since it is not clear that the progression will be linear w.r.t. the number of cores being used.

Hi,

In my poster on CUDA Zone (entitled “CUDA Accelerated Sparse Field Level Set Segmentation of Large Medical Data Sets”), I report a 360x speedup over a CPU implementation and a 9x speedup over an OpenGL implementation of the same algorithm.

I used an existing 3rd party library (ITK) to benchmark the CPU algorithm. I didn’t write the code for this library, and I don’t know how optimized it is or even how many cores it utilizes.

My co-author wrote the OpenGL implementation and spent a long time optimizing it.

In my opinion, the 9x speedup over the OpenGL implementation is a much more interesting result than the 360x speedup over the ITK implementation. That being said, ITK is the de facto standard for volume segmentation and I believe comparing my CUDA implementation to ITK adds value to my poster.

Since my CUDA implementation has indeed been carefully optimized, my poster falls into your 3rd category to the extent that the ITK implementation is sub-optimal.

cheers,

mike

thanks for the comments guys!

@Ailleur: “slap together some CPU equivalent”, holy cow! I’m guessing that’s not optimized at the assembly level. I appreciate your honesty :) , I struggled with that same problem too. hmmm, suppose you had 100x faster, if you divide by 4, it’s only 25x faster, doesn’t that look much less impressive? (granted you’re correct that in most cases it won’t scale exactly linearly with cores)

@Mike: an author of a cuda zone poster, excellent, I’m very happy you’ve replied. 360x faster is very impressive! I’m not familiar with volume segmentation, is the ITK library known for it’s speed… or is it more of a reliability issue (or both)? I would agree with your assessment of the 9x faster since you’re comparing apples to apples.

Do you know how close your program reached the theoretical bandwidth/gflop?

hmmm… I’m beginning to rethink my earlier statement about compute bound and bandwidth bound, can these huge speeds just be a result of unoptimized cpu versions? (please feel free to prove me wrong)

Another reason for high speedups can be the type of instructions issued.
rsqrtf() is probably much, much faster on GPU than on CPU. Also, if you can use things like __sincosf in your program, you might get ‘higher’ speedups.

One of the sources of really impressive speedups, even against a well-optimized CPU version, is using the texture interpolation hardware. On the CPU, that’s a memory fetch plus some flops. On the GPU, that’s a memory fetch. Adds up if you do it a few billion times.

Imagine you spend some weeks or months on some CUDA project and you want to publish it. Will you then spend another month to set up and optimalise a CPU code which is of no use to you except for comparison?

Good point.

On the other hand, the hardware matters too. People might think it is unfair to compare latest GPU to the 2 year old CPU.

But given the budget, no everyone has access to the cutting edge cpu but everyone can buy latest GPU with several hundreds dollars

what’s worst is you’re encouraged to spend less time optimizing the cpu code since it will only serve to lessen the performance of your gpu results

at the very least, one could use readily available open source optimized linear algebra routines (if you’re creating the cpu version from scratch)

you can get quad cores for less than a tesla card, peak double precision on Intel Core i7 965 XE is 70 GFLOPS where as the tesla is 78 GFLOPS… sure your single precision 5x-10x times faster, but where does 300x to 600x come from? on top of that, on the cpu you don’t have to worry about excessive latencies or designing general purpose programs for a vector machine

…i’m starting to feel like the gpu is being a little over blownExternal Media

I think this is a matter of perspective: Sure claims of 300x to 600x are ridiculous. But look at the example you just brought up:

  • Core i7 965 XE is $1000 for 70 double precision GFLOPS

  • GTX 285 (same chip as Tesla) is $350 for 78 GFLOPS

And if you consider single precision (which you should since the GPU is not really a good fit yet for heavy double precision calculations), then you are talking about a >5x improvement over a CPU for a very cheap device. Consider the expense, space and power requirements of four more computers vs. a single GTX 285. That’s fantastic unless you were expecting the GPU to be some kind of magical computing unicorn exempt from basic engineering constraints. :)

Deciding exactly how much faster CUDA is than your CPU is useful, but can easily degenerate into pointless drag-racing. Once the novelty of GPGPU wears off, people will go back comparing the speed of algorithms and implementations without focusing so much on CPU vs. GPU. What matters is performance vs. resources (cost, power, space, whatever) because ultimately the goal is to use your code to get something done, right?

Personally, I find coding a fast CUDA implementation easy enough for problems that show high data parallelism that I do that first, then decide whether it is worth my time to create an optimized CPU implementation. If I do work on a high(er) performance CPU path, I’m relying on OpenMP and GCC auto-vectorization to do the heavy lifting. I don’t have the time or expertise to screw around with SSE instructions and tinker with thread pools and optimal cache behavior. Lots of people complain that CUDA hardware is a black box, but you can get within 50% of optimal without much effort once you internalize a few simple rules. ( That last 50% is what people get to write papers about. :) )

As Thrust matures, this will get even easier, and if they can implement a good multicore+SSE CPU code path, I’ll just use that for everything. (I had been holding out for a compiler that would convert CUDA directly to multicore+SSE CPU code, but that seems to have been stalled somewhere in the transition from research project to practical tool.)

I’m glad you said it, i thought i was crazy

it sort of looks like that’s how it’s being marketed… someone looking in from outside the gpu community, looks and sees 240 cores (or 800 cores ATI older cards) and 300x times faster this, 300x times faster that (and it’s the first thing you see on cuda website), what kind of conclusions do you think they will make? (that’s wat happened with me…)

I agree with that a 5x speed up is significant, my quarrel was with the 300x and 600x speedups

I slightly disagree with pricing, you can get lower end quad cores (i quoted high end one cuz i didn’t have gflops for the lower end) for similar prices as gtx285 even if we forget about the $1500 tesla. The speedup is still a very good selling point

however i’ve seen quad cores reporting 200 gflops where the gtx 285 is at (MAD) ~640… from a theoretical stand point, the 5 times faster might be difficult to achieve

thanks for your reply seibert

On my cuda application I found several orders of magnitude speed increase. My CPU version was written 1st to the best I could make it. Rewriting in cuda forced me to make fundamental changes that improved performance. Apples to oranges, my old code is way slow. But if I use my emu code on my CPU vs my gpu code, I only have a speed up of 10x. Of course I’m a newb but still. Why don’t people benchmark with their emu code instead of making a new version?

Matt

Benchmark against device emulation? And you only have a 10x speedup?!

Well, all benchmarks are nonsense at some level, and I really don’t know how you deal with the problem of people comparing CUDA to suboptimal CPU implementations. It’s hard to decide what is “good enough” to be a fair comparison.

That also depends on how difficult the 200 GFLOPS is to achieve for your problem. :) It’s quite easy to starve a CPU if your working set doesn’t fit in the cache. The GTX 285 has as much bandwidth to 1 GB of global memory as the Core i7 has bandwidth to the L1 cache on one core. The GTX 285 leads CPUs in both FLOPS and memory bandwidth, and different problems are limited by one or the other.

In case tmurray’s reaction isn’t clear: Device emulation is for debugging only, and is almost always the worst performing way to run your algorithm on the CPU. In fact, I think a lot of misleadingly large GPU/CPU speed up measurements are made by comparing device emulation on the CPU to running on the GPU.

If you only see 10x difference between GPU and device emulation, I would start to worry that you are underutilizing the GPU.

This is a terrible understatement. Device emulation is unbelievably slow even if you just take into account the per-thread overhead.

See this link, http://gpgpu.univ-perp.fr/index.php/Image:…perf_03_900.png , emulation is often 1000x-10000x slower than native execution on a GPU. I’ll post some results on here around dec 15th that show maybe a better way of comparing apples to apples.

Step into my world for a minute (it is pretty boring, so you were warned).

I solve lots of linear equations using distributed memory solvers on clusters. Most of them live and die by BLAS3 performance. These codes are double precision and about as far away as from embarrassingly parallel as you can get. Until Fermi comes along, double precision is something of the unwanted stepchild of GPGPU computing. Headline numbers are nothing like as impressive as single precision performance and in comparison to embarrassingly parallel single precision codes that can really leverage the data parallel nature of the GPU architecture, performance is “attrocious”. Using CUDA and doing nothing else than switching over to a homebrew GPU BLAS3, I get 2.5-5 times application speed-up over optimized vendor BLAS. Is that a big deal? You bet it is! Simulations that used to take a week of cluster time now take less than two days. In my field that is like manna from heaven. It is safe to say that there isn’t any other way I can get that kind of performance improvement in the computing environment I work in without writing large five figure cheques.

So performance needs to be put into context. I am willing to accept that there are certain classes of compute bound applications which can achieve two orders of magnitude speed up over good serial CPU implementations. Even sitting at the absolute other end of the computing spectrum, I see enormous tangible benefits from this computing paradigm.

This is my favorite CUDA speedup claim:

[url=“http://www.amaxit.net/technicalsupport/images/Temple%20University,%20How%20to%20reach%2047000%20speed%20up%20on%20a%20GPU%20CUDA.pdf”]http://www.amaxit.net/technicalsupport/ima...0GPU%20CUDA.pdf[/url]

They claim 47000x (yes forty-seven thousand times speedup!) by comparing single threaded execution on the GPU to multi-threaded execution on the GPU.

You can get any speedup you want simply by picking the baseline for comparison. You could give the speedup compared to me using pencil and paper to do the same computation and get even better numbers. The problem is that it tells you nothing about absolute performance. I think everyone should be required to include whatever the relevant performance metric is for the application they’re working on: how many pixels per second does your algorithm update, whats the number of cell updates per second, how many vertices do you process per second, etc. This would make it a lot easier to quickly weed out bogus speedup claims from the people doing good work and “only” getting 5-10x speedup when comparing a current generation GPU using massively parallel multithreading to a current generation CPU also using multithreading (albeit on a smaller scale).

Definitely. That’s the only real metric that matters. It continuously bothers me to see wildly unfair comparisons being made in published papers. Comparing the speed of one core of a CPU to a quad-tesla setup and then pointing out you had a 100x speedup (at possibly 10x the cost) seems almost dishonest. I’d love to see some kind of attempt at a “standardized” description to show speedups based on some performance vs resources metric. Of course, nvidia probably shouldn’t be the ones to do this due to obvious bias :) .

Rather difficult to come up with something concrete even in simple terms of hardware cost. Comparing a $400 CPU to a $400 GPU isn’t valid, since you can run the CPU code without a discrete gpu, but the opposite isn’t true. Development effort is an even more abstract notion… it generally takes more effort (my opinion) to write a CUDA code than to write the CPU code. For my work right now, I figured the best way to describe acceleration is just to honestly present the speedups vs various metrics of power consumption and cost, and let the reviewers make their own decisions.