any theory to predict the speedup

Is there any theory that we can use to predict the speedup if we port the CPU program to CPU+GPU platform?

Thank you,

Amdahl’s Law?

Yes. For a particular algorithm, let the algorithm you are porting to the GPU be memory bandwidth bound. Count up the total number of global memory reads and writes you need to execute the algorithm. Use the theoretical global memory bandwidth of the device to calculate a lower bound on the running time.

For speedups of your entire application, use Amdahl’s law like tmurray suggested.

There are lots of ways that this approximation can be wrong, but it is a reasonable 1st order estimate, under the assumption that your kernel is memory bandwidth bound.

Thank you for the information.

I think Amdahl or Gustafson’s law needs input from the actual implementation. It perhaps can give a very rough estimate. Has anybody used these laws or other theories to predict the speedup bound before the implementation?

What is meant by ‘memory bandwidth bound’?

Besides the global memory read/write counts, do we also need to count the actual operations, such as +, -, *, /, and add their actual time cycles to find out the actual time used theorectially?

Technically yes, but practically no. The global memory read/writes have a relatively high latency (a few hundred cycles), whereas each operation only takes 4 cycles (for the basic ones). So basically, if you’re doing a lot of global memory accesses, that is going to be your performance bottleneck…which, as MrAnderson wrote, can be used to give you a rough lower bound on the speedup.