timing kernel execution with clock()


I’m trying to time the execution of some kernels as accurately as possible on a C1060. I’ve used gettimeofday() and cudaEvents, which seem to be reasonably accurate, but I want to measure the execution of the kernels without any overhead of the kernel call, so I’ve turned to using clock() on the GPU.

This is my thinking so far: Since I want to eliminate all overhead of the kernel launch, I need to call clock() on the GPU before and after the kernel execution, but global functions have to be void, so in order to pass the number of clock ticks back to the CPU, I need to pass a pointer to some host address and then do a cudaMemcpy.

I have written this so far to where it seems like it should (or could possibly) work correctly, but the tick counts I am getting are obviously wrong. The variable I’m using to contain the tick count is of the clock_t type. Is it possible when I find the starting and ending clock ticks that the latter has wrapped around 0, since the tick count is so large?

This is the skeleton of my code:

[codebox]global void my_kernel(clock_t *ticks)


*ticks = clock();

/* do some useful stuff */

*ticks = clock() - *ticks;


int main()


/* setup memory for timing */

clock_t *h_ticks,*d_ticks;

h_ticks = (clock_t*)malloc(sizeof(clock_t));

cudaMalloc((void**)&d_ticks, sizeof(clock_t));

/* call kernel and record time */





times[k] = (double)(*h_ticks/CLOCKS_PER_SEC);


The times I’m getting are widely varied and some are negative, which leads me to believe that the tick count is wrapping. Could someone please tell me what I’m doing wrong?


See the CUDA SDK clock sample. It has a list of all the caveats you need to be aware of. One obvious issue in the code you posted is that you have a race condition there with every thread reading and writing *ticks simultaneously.

Regardless of clock(): As far as I know, if you use a CUDA event to measure a single kernel execution, then you are measuring exactly the kernel execution with no overhead. To verify, you could run your code through the profiler and compare the gputime and cputime values of the kernel launches with the cuda event readings.

Thanks for your reply. I redid the timing portion of my code, basing it off the clock example from the SDK, and now I am getting much more realistic values, but I still have a concern. From what I have read on the forum, cudaEvent should be very accurate in timing the execution of code on the GPU, but the times I am getting from it are several hundred times longer than those from clock() and even quite a bit longer than those from gettimeofday().

Even though the clock() function gives me satisfactory accuracy, the cudaEvents would be much more convenient to use. Why does cudaEvent not seem to give me good results?

I think clock() returns values in seconds, but the cudaEvent* functions return time in milliseconds, which would make the result seem 100 times bigger.

Okay, thanks. That was a possibility I was hoping for, but I couldn’t find any information on it. Could you point me to a reference which confirms that?

Try “man clock”. It actually returns a number where CLOCKS_PER_SEC == one second. And the CUDA programming guide or reference manual will tell you that events measure in milliseconds.

Sorry, it would actually be 1000 times bigger.

For the cudaEvent one being in milliseconds, just look at the Cuda Reference Manual.

clock() isn’t mentioned in the Cuda Reference Manual. It’s most likely equivalent to the standard C clock() function, which (see man page) gives seconds.