I have a basic question regarding speedup calculation.
I have a serial application designed to run on a CPU with a quad core.
The time taken by this serial application to execute on the Quad core CPU is t1.
Then, I parallelize this application using CUDA and run it on 512 GPU cores.
The time taken by this application to execute using 512 GPU cores is t2.
Now, I want to calculate the speed up of this CUDA parallelization.
The confusion I have is, which of the following options is correct/wrong.
a) We compare the timings for 1 core of CPU Vs. 1 core of GPU.
b) We compare the timings for four cores of CPU vs 512 cores of GPU. (In this case, the speed up would be: t2/t1)
In case I have a parallel application in openMP running on four cores completing in time t1. And,
I have a serial application in C running on a quad core but not making use of four cores completing in time t2.
Allow me to introduce you to <a target=‘_blank’ rel=‘noopener noreferrer’ href='http://en.wikipedia.org/wiki/Amdahl’s_Law’>Amdahl’s Law. That, and the fact that openMP isn’t particularly efficient at parallelizing an application, are one reason why not.
Those are the software reasons. Hardware reasons could include memory bandwidth, or Intel Turbo Boost.
I’d say the speedup would be t2/t1.
Ken_g6 is citing some reasons why t2/t1 might not be 4 on a quad core, although I only partly agree on the OpenMP one: It very much depends on the specific case. For my scientific applications (which are not memory bound) I actually get very close to 4, like 3.8 or 3.9.