Best way to report speedups?

enemyben88 · February 10, 2010, 9:56pm

Hi All,

I was just wondering what is the best way to report speedups of a particular application. I have a highly optimized CUDA version, and an unoptimized C++ version. I can get up to around 250x - 300X speedups vs. the C++ unoptimized version. However, I don’t really want to go back and clean up/optimize that code.

Would it be fair to measure the time it takes for a single thread to execute on the GPU, multiply that by the threads/block and #of blocks for the “serial” version? Basically, it would be if the serial version were executing on the GPU, vs. rewriting an optimized C++ model on the CPU. I’m curious what others have done, thanks!

seibert · February 10, 2010, 10:25pm

Using the performance of a single GPU thread is even more unfair than poorly optimized CPU code. The GPU is a very non-linear device, and the performance does not scale from 1 thread to N threads in any easy-to-predict way.

There is no easy way to solve your problem. You can report the absolute performance of the GPU (in whatever units make sense for your problem), which is useful for others. Relative speedups are nice for demonstrating how awesome the GPU is, but unless there is a reasonably well-tuned CPU implementation to compare against, it is just misleading marketing.

Now, if the unoptimized CPU version is actively being used by some people, showing a relative speedup could be meaningful to some audiences (“Hey, you should stop using that and start using this!”), as long as you include the footnote that the CPU version could certainly be improved as well. It all depends on what the statement is you want to make.

apangborn · February 10, 2010, 10:35pm

A single thread/core on the GPU is far from identical to a modern CPU in terms of performance, so no I would say that isn’t a fair comparison at all. CPUs have higher clock rates, many functional units per thread (CPUs actually convert blocks of sequential instructions into data parallel instructions and execute multiple instructions in parallel internally), caching (this is HUGE for any computational even remotely memory bound), etc. etc.

Also remember that not all of your threads and blocks are actually computing simultaneously. If you launched a grid of 100x100 blocks each with 512 threads, then by your logic you’d have a speedup of over 5 million.

Topic		Replies	Views
What is maximum speed-up that can be obtained with GPU? CUDA Programming and Performance	6	11912	June 24, 2016
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1823	October 26, 2017
CUDA perormances CUDA Programming and Performance	10	7126	January 22, 2008
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8609	December 18, 2008
CUDA and fixed-point comparaison on big array Is CUDA suitable for fixed-point comparaison? CUDA Programming and Performance	7	2488	May 9, 2011
Number of thread blocks and threads in those, difference for performance? CUDA Programming and Performance	1	380	September 6, 2021
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7887	August 16, 2007
CPU vs GPU performance CUDA Programming and Performance	3	476	December 16, 2018
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	155	July 7, 2024
CUDA slower than CPU? CUDA Programming and Performance	7	760	August 18, 2023

Best way to report speedups?

Related topics