Intel paper: Debunking the 100X GPU vs. CPU myth

A somewhat controversial paper was presented at the ISCA conference this week:
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, by Victor Lee et al. from Intel.

I think it may be an interesting read for the CUDA developer community… (and it has been long since we last had a speedup measurement methodology debate :) )

The authors compare the performance of several parallel kernels on a Core i7 960 against a GTX 280. Kernels are highly-tuned on both sides.
They measure very reasonable speed-ups, from 0.5x to 14x, and 2.5x on average.
The papers follows on by analyzing the causes of suboptimal performance on both sides, and the implications on architecture design.

So here is the official PR answer from NV:
http://blogs.nvidia.com/ntersect/2010/06/g…says-intel.html

Unlike the blog poster, I would not question the fairness of Intel’s analysis. But he does have a point in claiming that the myth is that modern CPUs are easier to program than GPUs.
In this regard, it is interesting to note that Fermi’s improvements are mostly on the programmability side, and not that much on raw performance…

Any thoughts about this?

From a high performance perspective I agree with this paper. My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp active, both using multiple cores.

If you are willing to multi-thread your application, map it onto SIMD units, and carefully orchestrate memory traffic such that you hit the theoretical peak OP/s, then the speedup of your application will be commensurate with the difference in peak performance between two different architectures. The ratio of peak instructions/s (~600 or ~900)/(~50 or ~100) = ~9-12 or peak bandwidth GB/s (~140/~25) = ~5-6 should be the speedup of you application if you spend an infinite amount of time optimizing each implementation. It should be ~5-12 on that GPU compared to that CPU, not 100x.

A more important take away from this paper was that getting close to the peak performance on both the CPU and the GPU required:

  1. Multi-threading to saturate all cores.
  2. SIMD to exploit multiple FUs per core.
  3. Tuning memory traffic to exploit the full bandwidth from all memory controllers.

For people on this forum, I would think that this paper makes CUDA more relevant as it addresses all of these issues.

EDIT: The claim of 100x speedup is really more of a comparison of a multi-threaded, SIMD implementation with regular memory accesses to a highly-tuned single-threaded, SISD implementation, with unstructured memory accesses, or perhaps an implementation in a language like CUDA versus an implementation in a language like C without SSE, without threads, and with an abundance of pointer chasing.

Section 5 of this paper (5.1: Platform Optimization Guide and 5.2 Hardware Recommendations) is a contribution to the discussion and I believe is something close to required reading for anyone who has the ambition to compare the performance of CPUs and GPUs:

However the rest of the paper is mostly Intel marketing:

Handpicking benchmarks neither proves nor debunks anything and is no more objective than the Nvidia marketing response.

Wake me up again when there is an efficient CUDA compiler running on its own target architecture.

" My only gripe was the fact that it compared theoretical scalar throughput between the CPU without SSE against the GPU with only one thread per warp ctive, both using multiple cores.
"

Is he crasy?

Actually they seem to neglect the distinction between warp size (32) and “microarchitectural” SIMD width (8). So the quoted scalar throughput should be 4 times lower : there is just no way to issue more than 1 instruction per SM clock on GT200.

I believe scalar throughput is a meaningful metric for measuring how fast you can process scalar work such as address calculations and loop control.

Even in CUDA applications, I measured that around 30% of all instructions actually do scalar work and could be replaced by scalar instructions (although that ratio shrinks a bit in highly-tuned apps). The GPU suffers an overhead by performing address calculations and control inside SIMD units, and that should be taken into the balance for the comparison.

The 100x speedup might be relevant for a customer has some legacy code they want to accelerate and choose between:

  • spending x$ in faster CPUs, throwing more hardware at the problem (may not even work if the implementation does not scale),

  • spending y$ in development to get a 20x speedup on their current CPU hardware,

  • spending z1$ in development and z2$ in GPUs to get a 100x speedup.

At the end of the day, what matters to them is the development effort required to get a decent acceleration, not really the peak performance… Unless they can use some readily-available, highly-tuned libraries.

Well, at least they wrote 3 pages explaining why they believe their benchmarks are meaningful and representative and what their workload characteristics are, instead of just dismissing the other position by handwaving… ;)

Which application(s) do you think they should have included, which have characteristics not covered by their benchmarks?

It is wrong approach. Anyway we need some warps active, about 16, and we can add 16*32 threads for free. Cause we NEED minimum 384 active threads to load GPU, with 16 active threads it will work like it has 384 anyway. That comparision is total biased.

reading that back I can see how what I wrote could be misinterpreted. What I meant was that in one table in the paper, they compared theoretical performance using one thread per core and no sse on the cpu against one active simd unit on the gpu and called it scalar throughput, even though threads in cuda are implicitly mapped onto simd units.

To clarify, it was only one entry in one table, not the main .5x to 14x result.

The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting “a fair comparison between performance on CPUs and GPUs”, they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise.

"The applications in publications where 100x or greater speed-ups are claimed. Nvidia has conveniently provided a list. Approach the authors and ask if they would be willing to let Intel optimize their CPU code. In fact, if the Intel employees who wrote this paper were serious about presenting “a fair comparison between performance on CPUs and GPUs”, they would do they same and allow Nvidia to optimize their code on GF100 Teslas. I think this would be a very positive and enlightening exercise. "

This will be good. But we have not enough Intel enginers to optimize every application. Some time it is harder to optimize for CPU, some times for GPU.

Agreed. My point (and this was with the marketing aspect of the Intel paper) was that they were not so much “debunking” 100x-plus speed-up claims, as ignoring them. Too bad the TV show “MythBusters” doesn’t cover HPC topics. :)

In my experience, that would be a positively glowing endorsement of a lot of scientific codes.

Well, they did cover GPU computing: :)

http://www.nvidia.com/object/nvision08_gpu_v_cpu.html

Great opening line on the NVIDIA response:

“It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is only up to 14 times faster than theirs”

I’d say it’s probably worth pointing out that the i7-960 they were using costs what, double what the gtx280 costs now? I do definitely agree though (and it’s been discussed here before) that a lot of the speed-up numbers quotes are pretty bogus. One that claimed “100x speedup” by using 4 teslas with a combined cost far in excess of the cpu they were comparing to stands out in my mind. It’s pretty impossible to give a real apples to apples speed comparison because of disparities in the effort needed to program, hardware cost, lack of portability, etc, so all we’re left with is real world results of people getting their work done faster.

I personally prefer to say that for very little cost, my runtimes went from 8 hours to 20 minutes, instead of quoting “25x” or something of that manner.

Or for me, quoting that I have upgraded a $300 PC, that is 3 years old, for les sthan $100 and have an application that is faster than on a $1000 PC?
(and actually, as the app run on GPU, I still have the PC toally reactive when the app gobble the CUDA GPU resource)

Suffers? that’s OK with me. Separate units for address calculation working in parallel with SIMD (not in series!) may speed up our apps a bit ( a fraction of that 30% time it takes to calculate the address) but they would take precious silicon space.

I prefer to have more SIMD units instead, thank you.

CPU is not dead, that’s for sure.

I think that the future lies in selecting the proper hardware for the job, and converegence of software and hardware. like:

“Your algorithm requires 64kb of constants cache? here’s the chip for you. Can it do with 4kb? i can offer you a higher clocked chip for the same price”

And given the extent to which many CUDA applications end up memory bound, the extra instructions to compute addresses overlap with global memory transfers, assuming you have a decent occupancy.

Maybe. But then power consumption might be a problem. At least we may want to avoid burning power by having 8 neighbor units perform the exact same computation at the same time…

Also, I would be happy to have more vector registers to store actual vector data instead of replicated scalar values, and more L1 cache for my data…

(Don’t worry, this is just my own pet theory ;))

Yes, but that could also justify that there is no need to compute fast since you end up memory-bound anyway. :)

In fact, agree that the Intel CPU suffers from scalar overhead too, because of its shorter SIMD width.

For instance, to compute the addresses necessary for loading a vector of 32 consecutive values, the GPU needs one 32-wide instruction (4 cycles), while the CPU will need 8 scalar instructions (one for each 4-wide vector).

Some Larrabee-like wide SIMD might need only 2 scalar instructions.

So the “scalar throughput” is not so different… Except the CPU can run scalar instructions completely in parallel with SIMD instructions…

Sylvain,
Thanks once again for bringing up this topic. After a fair amount of GPU programming (the initial excitement, hype, 100x and then settling to reality), I think it is important to set things in perspective. Finally, Intel had done it.

But that said, It is impossible to reach theoretical performance on CPUs without spending humungous effort.

For example,
I reached 56GFlops DP performance on TESLA C1060 after sitting on the program for 1 day… I am sure this will scale well on the FERMI as well.

To reach 3.4GFlops DP on the CPU using SSE, I spent more than a week… involving so many prototypings… Taking care of alignment (well, Nehalem has finally fixed it…), cache, Register pressure, breaking dependence chains,… Aaah…
Its almost impossible to know what the heck is happening inside the CPU unless u have a tool like vTune. (even vTune only gives an approximation and it is not foolproof)

And all the tools from intel are “commercial”. And everything from NVIDIA is “free”. We should also look @ developer friendliness as well.

If Intel is really commited to its developers, it should consider releasing some developer friendly tools free of cost.

Sylvain,
Thanks once again for bringing up this topic. After a fair amount of GPU programming (the initial excitement, hype, 100x and then settling to reality), I think it is important to set things in perspective. Finally, Intel had done it.

But that said, It is impossible to reach theoretical performance on CPUs without spending humungous effort.

For example,
I reached 56GFlops DP performance on TESLA C1060 after sitting on the program for 1 day… I am sure this will scale well on the FERMI as well.

To reach 3.4GFlops DP on the CPU using SSE, I spent more than a week… involving so many prototypings… Taking care of alignment (well, Nehalem has finally fixed it…), cache, Register pressure, breaking dependence chains,… Aaah…
Its almost impossible to know what the heck is happening inside the CPU unless u have a tool like vTune. (even vTune only gives an approximation and it is not foolproof)

And all the tools from intel are “commercial”. And everything from NVIDIA is “free”. We should also look @ developer friendliness as well.

If Intel is really commited to its developers, it should consider releasing some developer friendly tools free of cost.