I have a probably naive question: how useful is it to use the computation time in emulation mode to estimate the execution time in “serial” mode? How optimal is the emulation compiler?
It’s pretty far off I think - emulation mode is single-threaded, and I believe some math operations are executed pretty differently on CUDA vs. CPU.
If you want to get a reliable speedup measurement and don’t have anything to compare to, you most likely won’t get around writing an “optimized” CPU-code yourself.
I had the same problem writing a calculation which only existed in MATLAB before. After speeding up from days to milliseconds I thought that comparing to matlab timing isn’t fair because of matlab being optimized towards accuracy and not to speed. So I had to handcode the same thing optimized (using IPPs and stuff) for the CPU (which led me to an inner conflict because I for sure wanted my GPU code to be faster, but also wanted reliable numbers) ;)
It was a lot of overhead work, but it comes in handy when you want to defend GPU usage for your project. Moreover we are now trying to implement a solution, which can also be run on machines without CUDA-capable GPUs. The code will run 10x to 30x slower, but it will run.