Calculation of time of a bunch of kernels Is there a simpler way to measure time?

The clock example from the CUDA sample projects shows how the time for executing a kernel is measured. I wonder if there is a simpler solution.

I assume that something like:


main( int argc, char** argv) 


c = clock();


clocks = clock() - c; 


yields no useful results, as the clock() function outside a kernel refers to CPU clocks and the CPU stalls until the kernel has finished execution.

In the clock example, only the thread with threadId 0 calls the clock function. Is it therefore possible to measure time in the following way (I use only 1 Block but several threads):

kernel() {

if (threadId == 0) c = clock();

do some memory copying according to threadId


if (threadId == 0) c = clock() - c;


The variable c would then have approximately the execution time for all threads, as I call the __synchthreads() and only thread 0 does a calculation after the __synchthreads().

Or is the example in the clocks project already the easiest way?



It is not only possible to call clock() in every thread - I strongly recommend doing so! The reason is that clock() being a time stamp, the differences you compute will be wall clock time, ie. they include any stalls the thread experience. To make sure, you get a consistent view of whats going on, you really need to look at each thread individually.

For the purpose of getting only the maximum block execution time, you can use the code you showed above, but you need to add a syncthreads before the first clock() call to make sure that all threads have a common start time. Note however that each multiprocessor has only 8 ALUs, so for any block size larger than half-a-warp, you won’t get close to the real thread time.