How Calculate Speed Up with CUDA


I’m working with a NVIDIA GFORCE GT 750M (Kepler, 384 cuda cores … so, 192 cuda cores per SMX).
For calculate the speedup and the efficiency we need compare, for example:

1.- sequential_time(CPU) / parallel_time(GPU) with N cores.
2.- sequential_time(CPU) / parallel_time(GPU) with N+1 cores.
3.- sequential_time(CPU) / parallel_time(GPU) with N+2 cores. …etc.

and generate somethings like this

So… how I can calculate the speedup of my code with varied amount of cores in cuda?? is this possible?

Thanks a lot!!

It’s not a trivial matter to scale CUDA code execution across a subset of SMs or cores provided by a GPU, like it is with OMP threads on a multicore CPU.

It could be done with rather unusual code modifications, but they would be rather extreme and the results would not be indicative of what you should expect on another GPU with that many cores/SMs.


what would be the correct way for the speedup and efficiency calculations in cuda?

Speedup compared to what? A CPU implementation? Divide the end-to-end run time of the application on the GPU by the run time on the reference platform. That’s the speedup people care about in practice.

If you want to get an idea about scaling, use different GPU models with different number of SMs and / or multiple GPUs. Ideally the GPUs would be all from the same architecture, as the microarchitecture varies considerably across generations.

An excellent example can be seen in a recent paper on the HPCG benchmark by E. Phillips and M. Fatica. They scaled from a small embedded system all the way to super computers using various Kepler-family GPUs: