timers on GPU and CPu

I had implemented a code for sorting on GPU, and i used the following code to estimate the elapsed time:

StopWatchInterface *hTimer = NULL;

	.....do the sorting
			printf("Time: %f ms\n", sdkGetTimerValue(&hTimer));

then i implemented the sorting to be serial on CPU, WHAT is the best timer code in C to estimate the elapsed time and is fairly comparable with the previous parallel timer

When you look into the functions in helper_timer.h in SDK, you would find that they use
gettimeofday() function. Essentially, sdkStartTimer(&hTimer) and sdkStopTimer(&hTimer),
both of them use serial timer. One thing should point out is that do not forget to do
the synchronization after sorting. In other words, insert cudaDeviceSynchronize() after
sorting. Otherwise, the timer only measures kernel launch time.

For parallel timer, cudaEvent could be one option. This blog might be of help. https://ivanlife.wordpress.com/2011/05/09/time-cuda/

In summary, if you want to measure serial sorting code, you can also use gettimeofday() timer.


nvprof also display % time used by each function called, look example:

==5467== Profiling application: ./callkernel
==5467== Profiling result:
Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 83.35%  13.602ms             1               0  13.602ms  13.602ms  13.602ms  kernel(void)
 16.65%  2.7176ms             0             243  11.183us  10.176us  207.71us  kernel1(Nos*, int*, float*, int)

==5467== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 86.94%  1.09723s         1  1.09723s  1.09723s  1.09723s  cudaDeviceSetLimit
 11.69%  147.55ms         1  147.55ms  147.55ms  147.55ms  cudaDeviceReset
  1.10%  13.908ms         1  13.908ms  13.908ms  13.908ms  cudaDeviceSynchronize
  0.20%  2.5456ms         1  2.5456ms  2.5456ms  2.5456ms  cudaLaunch
  0.03%  398.51us        83  4.8010us     208ns  173.02us  cuDeviceGetAttribute
  0.02%  301.65us         1  301.65us  301.65us  301.65us  cuDeviceGetName
  0.00%  54.659us         1  54.659us  54.659us  54.659us  cuDeviceTotalMem
  0.00%  1.5250us         2     762ns     343ns  1.1820us  cuDeviceGetCount
  0.00%     994ns         1     994ns     994ns     994ns  cudaConfigureCall
  0.00%     598ns         2     299ns     205ns     393ns  cuDeviceGet