Measuring Application Speedup in CUDA using Amdahl's Law -Clarification Needed

cirus · September 1, 2009, 4:49pm

Hello All,

I implemented a portion of my code on CUDA and executed at GPU. The speed gain was observable but I need to compute overall speedup of the full application. For this I am relying on Amdahl’s Law for Overall Speedup. Reference: [url=“Amdahl's law - Wikipedia”]http://en.wikipedia.org/wiki/Amdahl’s_law[/url]

What is confusing me is two versions of the Amdahl’s Law are given:-

Overall Speedup = 1 / [(1-P) + P/S]
where P is fraction of code made parallel and S is the speed gain for that portion P.

and

Overall Speedup = 1/[(1-P) + P/N] for parallelization

Where P is same but N is number of processors.

Q1. What is the difference between the two?

Q2. Which one should be used for GPU case? Is my guess right that first version is for uni-processor and secod one for multiprocessor?

Comments and clarifications appreciated.

avidday · September 1, 2009, 5:33pm

The first version predicts the actual speedup for a given parallel factor S. The second version predicts the upper bound of speed up for a given parallelization over N processors. As to which of the two you should use for the GPU, the answer is probably neither. Gustafson’s Law is probably more applicable to the sort of data parallel paradigm that CUDA operates in.

Dislcaimer: I am a materials engineer, not a computer scientist so take all of the above with a grain of salt.

SPWorley · September 1, 2009, 5:35pm

The two formulas are basically the same. The first is the more useful form. The second is just making an assumption that speedup is proportional to the number of processors… 3 processors means 3x speedup. Of course that’s never really the case.

But don’t neglect the fact that there’s often a constant term… overhead usually of sending information to and from the GPU.
For example, you have 1GB of numbers to add together. The GPU can do it 10 times faster than the CPU. But just sending the values from the CPU to the GPU over the PCIe bus is 5 times slower than the CPU compute… so you’re killed by overhead no matter how fast your GPU is!

Sometimes when dealing with potential speedups I use the simple mental formula of 1/(1-P). This is Ahmdal’s law assuming an infinite speedup! But that gives a nice best-case possible baseline. Your program may spend most of its time working on some fancy math. Your GPU version would is super-fast. But even assuming that GPU is super-speed, you’re still limited, no matter what, to the 1/(1-0.66) = 3x speedup even in the absolutely positively ideal case. I like this “infinite” rule for quick estimations because you don’t even need to consider the GPU coding itself to sometimes realize it may not worth doing.

mfatica · September 1, 2009, 5:43pm

The speed-up is defined as Ts/Tp.
If you have already ported the code, you can compute the speed-up you are getting.

Amdhal’s law could be used to find out the potential speed-up before you start the porting.
For example, if your code is spending 50% of the time doing I/O, it does not matter how fast the rest of the code will go on the GPU, you will be limited to
an overall speed-up of 2x.

cirus · September 1, 2009, 9:07pm

Wow, that was quick!!. Thank you all three for replying.

@SPWorley: Agree you on data transfer operations slow down full application speedup, but I am focusing on processing speed of GPU. I get speed gain for a portion of my application. I am deliberately avoiding data transfer (I/O bound operations) as to bring fairness to CPU GPU comparison. My aim is to compare the processing times of GPU to CPU for full application. Tutorials and papers that I have come across, stress more on processing speed or bandwidth (as performance metric) . Full application speedup depends on nature of application. My application is not 100% parallelizable but the portions that are parallelized show local speed gain. This prompted me to go for Amdahls law to check the maximum overall speed to observed results.

I am waiting for the day when researchers make GPU an independent unit so that it no more relies on CPU for control.

@avidday: Thanks. It is clear :)

@mfatica: I may be wrong but I guess the formula Ts/Tp is for speed up due to pipeline execution at uni-processor.I agree with your view for the latter part.

To all:

I came across the paper by “M.D.Hill and M.R.Marty - Amdahl’s Law in Multicore era” IEEE 2008. There they are talking about symmetric , asymmetric multiprocessors .
I have NVIDIA Ge8800Gs that has 12MP with 8 core per MP = 96 cores.
So, I am assuming that my Ge8800GS is symmetric MP

According to the paper they have different formula for speed up.Can you comment on it ?Which one should I choose , the one given at Wiki i.e Overall speed = 1 / [(1-P) + P/S] or the one given at the paper titled above.

Thank you and help appreciated.

_Big_Mac · September 2, 2009, 8:23am

Ts/Tp is a very general formula for calculating speed-up and can be used for much more than just pipelines. It’s a very common way to measure speed-up of a GPU (Tp) versus some reference CPU (Ts, though for fairness the CPU implementation should also be parallelized as much as possible)