I’m trying to time the execution of some kernels as accurately as possible on a C1060. I’ve used gettimeofday() and cudaEvents, which seem to be reasonably accurate, but I want to measure the execution of the kernels without any overhead of the kernel call, so I’ve turned to using clock() on the GPU.
This is my thinking so far: Since I want to eliminate all overhead of the kernel launch, I need to call clock() on the GPU before and after the kernel execution, but global functions have to be void, so in order to pass the number of clock ticks back to the CPU, I need to pass a pointer to some host address and then do a cudaMemcpy.
I have written this so far to where it seems like it should (or could possibly) work correctly, but the tick counts I am getting are obviously wrong. The variable I’m using to contain the tick count is of the clock_t type. Is it possible when I find the starting and ending clock ticks that the latter has wrapped around 0, since the tick count is so large?
This is the skeleton of my code:
[codebox]global void my_kernel(clock_t *ticks)
*ticks = clock();
/* do some useful stuff */
*ticks = clock() - *ticks;
/* setup memory for timing */
h_ticks = (clock_t*)malloc(sizeof(clock_t));
/* call kernel and record time */
times[k] = (double)(*h_ticks/CLOCKS_PER_SEC);
The times I’m getting are widely varied and some are negative, which leads me to believe that the tick count is wrapping. Could someone please tell me what I’m doing wrong?