Amdahl's Law for GPU Is Amdahl's law accepted for GPUs too?

Hi all,
I’m using GTX 280 to run an algorithm which I run it on CPU too. I wanted to do the speed up comparison between GPU and CPU. So, I divided the computation time of CPU by that of GPU. I wanted to see according to the Amdahl’s law (http://en.wikipedia.org/wiki/Amdahls_law) how much the prediction is for the maximum speed up by this GPU. However, I confused how to use the Amdahl’s law for a GPU. What number must I use for the processors number: 30 or 240 or something else?!

Moreover, as the size of system increases I gain more speed up from GPU in compare with CPU. But, according to Amdahl’s law this number must be fixed!!(b/c the number of processors and the fraction of the algorithm that can be run in parallel are both fixed numbers). Any explanation?!?

:blink:

Any helps or advices are appreciated

Looking at the wiki page, it just looks like a fancy way of saying the obvious: if you speed up 80% of your code 1000x, you’ll still have 20% left over and the total speedup can’t be more than 5x. This is actually the achille’s heel of all those “80x” performance claims regarding CUDA, because they’re just focusing on the inner loop and aren’t revealing that all the other overheads and operations reduce the effective speedup quite a bit.

But I’m not sure what you mean by trying to “apply” it.

I’m trying to explain it by numbers:
Amdahl’s law is saying if f% of the program is running in sequential and b%[/b] can be run on P processors, then the maximum speedup is less than 1/[f+(1-f)/p]
In my case I found that f=1%, and I’m using GTX 280 which has 30 multiprocessors, i.e. P=30. So, the maximum speedup for my code by using GPU must be less than 1/[0.01+0.99/30] which is 23.55, this is a fixed number and is independent from the size of data. However, if the size of data changes dramatically, I gained a big number of speed up when I’m comparing the GPU time with CPU.
So, I’m confused that if Amdahl’s law is true for GPU or not?!
I read it that GPU’s performance depends on the data size, but this dependency is not mentioned in Amdahl’s law!
Am I right or totally lost?!!! :blink:

If you consider the size of the data argument, you should look for the Gustafson law insead of Amdahl.

Now Amdahl must perhaps be somehow adapted : if you consider that a gpu is going k times faster than a cpu (that’s a strong simplication of course), you may certainly try to apply amdhal law by considering k cpus instead of 1 gpu. If you sequential section is on the host.

If your sequential code is on gpu, you could apply the law by saying a CPU is 1/k of 1 computing unit …

This is all pretty messy, but if you understand Amdahl’s law, you should be able to adapt it so that you get your speedup bound (or some pointless number which you have to interprete of course).

You know what the real problem is for me? I don’t know which number I should use in Amdahl or Gustafson’s law as the number of parallel processors!! does this number depends on the size of the input data?! e.g. when matrices are 100X100 and when the matrices are 1000X1000, does GPU use different number of processors for these two cases??

I’m thinking that for a matrix with the size of 100X100, I need 10000 threads which is 313 wraps, and GTX 280 needs 313/32~=10 multiprocessors, thefore in Amdahl’s law I should use 10 as the number of parallel processors. and the 1000x1000 matrix needs 31250 wraps and 976 multiprocessors, but because GTX 280 only has 30, so, in Amdahal’s law I should use 30 as the number of processors…am I right?!

If anybody has checked his/her GPU achived speed up by these laws??

There’s no good way to estimate the number of processors because there’s no real equivalence between GPU cores (be it MPs or SPs) and CPU cores. Apples to oranges.

The situation you are observing is a changing number of processors/threads being used efficiently in processing the data. Because of the architecture each of the 240 simd processors requires multiple threads to hide the memory latency. As your problem gets larger you are using more threads to process your data. As a minimum you need 4 threads per processor just to cover the instruction execution cycle. Each take a minimum of 4 cycles but another one can start in each cycle. Thus for 8 processors per each multiprocessor and 4 threads per processor you need 32 threads per multiprocessor, and so on plus the need to hide memory latency which will probably require (3 to 5)times more threads. For Amdahls law use the fixed number of processors 8Xof multiprocessors or 240. The change in performance is due to the efficiency of assigning work to the processor. Amdahls law is stating the best case not considering processor starvation etc. You will also various knees in performance improvement depending on how well the problem size is mapped to the architecture.

You are right, but what is the number of processors which we need to substitute in the Amdahl’s law for a GPU? (in my case GTX 280)

What you’re trying to do makes no sense. The equation portion of Amdahl’s law is a far too simple approximation to use like this.

So, how can I predict the maximum achievable speed up in a GPU?

please let me know if you have any suggestion.

Can’t? There are a lot more factors than number of processors for predicting GPU speedup.

The short answer is that there’s no easy and dependable method. Amdahl’s law is correct in stating that if you have sequential parts in the program which cannot be parallelised, you will only get a specific maximum speed-up. If 50% of the time spent in your program has to be sequential and left on the CPU, you’ll never get more than 2x speed-up. Sadly you won’t get much more from Amdahl’s law. It was originally devised, I think, to estimate the speed-up of a “cluster” of given CPUs against a single given CPU of the same type. That with idealized assumptions of no communication/synchronization lag, no copying data between nodes, perfect load-balancing and no time losses on the account of not 100% efficient parallelizations of the algorithm. And we haven’t even started to consider how the GPU works, stream processing, shared memory, coalescing (achieving which, for example, sometimes requires additional sorting) etc.

The only real option of estimating speed-up I can think of is, perhaps, to benchmark memory throughput of the to-be-parallelized part of the CPU code (how many gigabytes of data it chews through per second) and then compare it to the memory bandwidth of the GPU (or maybe ~75% of it). This assumes your implementation is really good, you don’t experience diminishing returns with scaling and transfers from CPU->GPU (and back) are rare. This is still far from really knowing how this will perform though.

Could you please give me some examples?

p is NOT the number of processors! It is the speedup. For a cluster, speedup would i guess equal number of processors, because all the processors are identical, but that’s a special case. p is just “speedup.”

What speedup will you get on CUDA? Who knows. It depends mostly on your data access pattern (whether it can be mapped to the various on-die SRAMs). If you post the algorithm we could take a guess.

The fact that f is 1% is good news for you. It just means that whatever speedup you can get on your inner loop is essentially the speedup you’ll get on your whole app. But that’s all it means. Really, all these “laws” are kind of dumb. It is math for math’s sake.

I second tmurray’s, BIC_MAC’s and alex_dubinsky’s posts. Amdahl’s law simply does not apply here.

I’d even go further and claim that there is currently no way to do ad-hoc estimations of “potential speedups”. You can estimate worst cases for sure (being 100% BW bound, being 100% compute bound), but in 99% of the practical cases, the BW argument doesn’t count. The scheduling hardware is usually more clever than any “law” you can come up with.

Amdhal’s law is still valid (and useful) if you look at the whole application.

Instead of using the serial and parallel splits, you could use the “run on the host” and “moved to the GPU” split to estimate a ballpark speed-up.

You can then refine the “moved to the GPU” in "I/O " and “compute” parts.

OK…What if in Amdhal’s law f=0 or a small number (f is the fraction of program which is running in CPU)? Then the max speed up predicted by Amdahl’s law is P, and

how much is P for GPU? :blink:

Don’t you get it? All you’re asking is “how much faster will my algorithm run on CUDA.” To which the answer is: how the heck do we know? It depends. P is likely to be anything from as much as 200 to as low as 0.1 (or worse).

To simply explain my application: During each time-step of the simulation, I am building a matrix which is the Jacobian matrix in the Newton-Raphson method. I have 2 codes, one is purely c++ to be run completely on CPU, and the other one is using CUBLAS to run the matrix solution on GPU, using LU decomposition. In the second code I make the matrix in CPU and then offload it to GPU for decomposition. With benchmarking of the first code I found that more than 99% of each time step is spending for the matrix solution, which in the second code I am doing this operation in GPU. I divided the total computation time of two codes to find out the speed up gained from using GPU. This number is completely depends on the matrix size, the bigger the matrix size, the larger the speed up. Up to here everything is OK!

Now, I want to verify this speed up by Amdahl’s law, in which f is a small number. Therefore, the maximum predicted speed up by this law is P! I want to now what is the relation between this P, and the speed up I got, and the GPU which I am using (GTX 280)!?

“predicted speed up by this law is P! I want to now what is the relation between this P, and the speed up

QED

If f=0, then Amdahl’s law doesn’t matter any more. You’ve obviated its purpose for existence.

P.S. the reason larger matrices are faster is that they allow the optimizations that CUBLAS uses to be more efficient. The optimizations have overheads, which are more noticeable for small problem sizes. (Plus CUDA itself has overheads which also manifest themselves more when the per-call work is small).