I am a bit new to CUDA and I want to calculate the execution time of my CUDA program. Basically I have to compare the performance of my program for calculating sum of two matrices of order 1000 x 1000, first on GPU and then I will use Device emulation mode to compare the performance on CPU. (Do you think Device emulation mode can be used as a bench mark ?)
So for this I need to know the execution time in each case. How do we find the execution time? what function and liberary is used and where exactly we put this function.
For the above to work correctly, you will need to place cudaThreadSynchronize() before stopping the timer. Alternatively use events, as mentioned.
Also, no, device emulation cannot be used as a benchmark. It’s usually WAY slower than if you’d reimplemented it on the CPU.
Luckily, it doesn’t have to be very hard. It’s sometimes enough to copy the contents of your kernel into host code, slap a for around it (perhaps two fors, one for blocks, other for threads within) and then put a
#pragma omp parallel for
above the for if your compiler supports OpenMP. Remember to enable the use of CPU vector intrinsics (SSE2 for example), the compiler should be smart enough to autovectorize at least parts of your host code. This can pass as a CPU benchmark although there are cases when it’s not that straightforward, for example if there’s shared memory in use.
Read about MCUDA http://www.gigascale.org/pubs/1278.html to find out how to make efficient CPU code from CUDA kernels. They haven’t yet released a compiler that does it for you but they describe the methods in their paper.
Usually, you cannot use emulator to compare CPU vs CPU-GPU performance
If you make lots of summations with 2 matrices A and B with size(1K*1K) and those matrix data are generated in CPU, then you will spend most of your execution time on moving data from host to device and back.
This is a good example to show that ratio of data transfer time to calculation time is very important factor that can completely negate GPU usage
Try scalarProduct project from SDK with GPU, however
move starting timer for GPU calculations in front of copying data from host to device and move stop timer after data moved from device to host. You will see that CPU execution time is better than CPU-GPU. This is how benchmarking should be done for this example in SDK, HEHE…