I was just wondering what is the best way to report speedups of a particular application. I have a highly optimized CUDA version, and an unoptimized C++ version. I can get up to around 250x - 300X speedups vs. the C++ unoptimized version. However, I don’t really want to go back and clean up/optimize that code.
Would it be fair to measure the time it takes for a single thread to execute on the GPU, multiply that by the threads/block and #of blocks for the “serial” version? Basically, it would be if the serial version were executing on the GPU, vs. rewriting an optimized C++ model on the CPU. I’m curious what others have done, thanks!
Using the performance of a single GPU thread is even more unfair than poorly optimized CPU code. The GPU is a very non-linear device, and the performance does not scale from 1 thread to N threads in any easy-to-predict way.
There is no easy way to solve your problem. You can report the absolute performance of the GPU (in whatever units make sense for your problem), which is useful for others. Relative speedups are nice for demonstrating how awesome the GPU is, but unless there is a reasonably well-tuned CPU implementation to compare against, it is just misleading marketing.
Now, if the unoptimized CPU version is actively being used by some people, showing a relative speedup could be meaningful to some audiences (“Hey, you should stop using that and start using this!”), as long as you include the footnote that the CPU version could certainly be improved as well. It all depends on what the statement is you want to make.
A single thread/core on the GPU is far from identical to a modern CPU in terms of performance, so no I would say that isn’t a fair comparison at all. CPUs have higher clock rates, many functional units per thread (CPUs actually convert blocks of sequential instructions into data parallel instructions and execute multiple instructions in parallel internally), caching (this is HUGE for any computational even remotely memory bound), etc. etc.
Also remember that not all of your threads and blocks are actually computing simultaneously. If you launched a grid of 100x100 blocks each with 512 threads, then by your logic you’d have a speedup of over 5 million.