How to compare performance by power and area profiles of a multicore device ...

Hi,

Is it a good idea to multiply power (in Watts) by the die area (mm2) to get a performance estimation of a CPU and GPU?

I did the following calculation:

Consider a GPU and CPU both in 28 nm technology. Then I calculate the performance such that,

Performance = sqrt(power (in W) x area (in mm2))

Now,

1) Nvidia GK100 (GTX 780Ti) 28nm GPU : 245 W and 551 mm2
2) AMD Kaveri 28 nm CPU : 95 W and 245 mm2

The “Performance = sqrt(power x area)” ratio would be,

sqrt(245x551)/sqrt(95x245) = 2.4

From the above calculation, I can get only 2.4 performance improvement! What is the wrong in above calculation? Please explain …

what happens when you change your calculation to watts/ flop and/or area/flop

put differently, normalize the input power and area according to flops achieved per power/ area

Performance = sqrt(power (in W) x area (in mm2))

This strikes me as a rather arbitrary definition, and the resulting units for “performance” don’t seem to be in any meaningful way related to computational throughput. What is the derivation of the above formula?

CPUs and GPUs skew the use of silicon real estate to different ends: CPUs spend a high percentage of the die area on storage devices of any kind (mostly caches, but also register files and buffers of various kinds), whereas GPUs devote relatively little area to storage and much more to execution units. Combined with other differences this lets GPUs perform well on some use cases where CPUs perform poorly, and vice versa.

As a “gedankenexperiment”, one could envision future parallel architectures where the GPUs is the main processor doing the majority of the work and the role of the CPU is to act as an attached “serial code accelerator” :-)

rather than these theoretical metrics, why not just try a bevy of algorithms and compare a high-end CPU to a GTX 780ti.

For example try:

sorting 100 million 32-bit floats using thrust::sort() with device pointers

Jimmy Petterson’s reduction test(search the forum for that)

100 4000x4000 Sgemm() calls (averaged)

One of my older brute force tests such as this one (you can copy from raw, compile as compute 3.5 and set the use_fast_math flag, and run from console)

https://github.com/OlegKonings/CUDA_brute_triangle/blob/master/EXP3/EXP3/EXP3.cu

I see over a 300x performance on that application over an overclocked Intel i7 vs. the GTX 780ti.

IMO Nvidia is very conservative when they talk about the performance difference between CPUs and GPUs. They tend to use Linear Algebra subroutines, FFTs and Monte Carlo simulations for their comparisons, and when they do the make sure the CPU implementation is optimal and on a high-end product.

If Nvidia used such problems as Brute force or image processing, that number differential would be much higher. No Phi is going to be able to compete head-to-head with the $ equivalent of consumer GPUs on those problems, but go ahead and prove me wrong if you think they can.

If you have tests that suggest otherwise, please post them with the source code( like I just did). If I am incorrect, and these ‘back of the envelope’ calculations truly indicate that the GTX 780ti is only 2 times faster that an AMD CPU, then I will truly be surprised.

I am not sure a Kaveri CPU is the best comparison point in the first place, because it also includes a GPU on chip best I know. A Haswell-Xeon without GPU would probably make a better representative of a “pure” CPU.

As CudaaduC points out, performance based on paper specifications might not translate well to real world application performance, and that applies to memory-bound problems as well as compute-bound ones. IMHO, the most meaningful comparisons use application-level performance of comparatively optimized code. For example, comparing CUBLAS performance on K40 or K80 to MKL BLAS performance on a Haswell-based server.

What I thought is in multi-core architectures, if a core performance can be define by (power x area), then multiple core performance can be derived by sqrt(power x area).

For example if one core has performance equal to (power x area), then 4 cores performance would be sqrt(4.power x 4.area) = 4.(power x area ). Simple as that :)

The power is important because Dennard’s law is now invalid. The area is according to the Moore’s law and Pollack’s law.

As stated I accept:
“CPUs and GPUs skew the use of silicon real estate to different ends: CPUs spend a high percentage of the die area on storage devices of any kind (mostly caches, but also register files and buffers of various kinds), whereas GPUs devote relatively little area to storage and much more to execution units. Combined with other differences this lets GPUs perform well on some use cases where CPUs perform poorly, and vice versa.”

++++++++++++++++++++++++++++++++++++++

As stated by little_jimmy, I would like to normalize the power and area with flops. How to do that? If the power and area is divided by flops, then what would be the performance measure?

I have seen the following link which used to compare the efficiency of CPUs and GPUs.

http://www.realworldtech.com/compute-efficiency-2012/

I need this kind of method to compare the performance. How to calculate it?

You are correct. It is a GPU+CPU. I just noticed. Haswell-Xeon is in different nm technology isn’t it?
Intel don’t have 28nm devices. That’s why I selected an AMD CPU. Do you know any good 28nm CPU-only device?

“For example if one core has performance equal to (power x area), then 4 cores performance would be sqrt(4.power x 4.area) = 4.(power x area ). Simple as that :)”

perhaps it is not “Simple as that :)”

seemingly, for this to hold, a gpu must have ‘cores’, and the ‘cores’ must be comparable to that of the other device as part of the comparison; otherwise you cannot directly compare power X area

among other things, the mentioned comparison based on power x area assumes that clock frequencies are comparable, such that the active and reactive components of power/ watts are comparable - power converted into flops, and power simply dissipated as heat
you probably need to discount the amount of input power each device simply converts into heat, and the amount of power converted to flops, to properly proxy the underlying architectural dichotomy

you can perhaps normalize area by using stated theoretical sp/dp flops to calculate flops/ area