I was just wondering what is the best way to report speedups of a particular application. I have a highly optimized CUDA version, and an unoptimized C++ version. I can get up to around 250x - 300X speedups vs. the C++ unoptimized version. However, I don’t really want to go back and clean up/optimize that code.
Would it be fair to measure the time it takes for a single thread to execute on the GPU, multiply that by the threads/block and #of blocks for the “serial” version? Basically, it would be if the serial version were executing on the GPU, vs. rewriting an optimized C++ model on the CPU. I’m curious what others have done, thanks!