The clock example from the CUDA sample projects shows how the time for executing a kernel is measured. I wonder if there is a simpler solution.
I assume that something like:
int
main( int argc, char** argv)
{
c = clock();
kernel<<<...>>>(...);
clocks = clock() - c;
}
yields no useful results, as the clock() function outside a kernel refers to CPU clocks and the CPU stalls until the kernel has finished execution.
In the clock example, only the thread with threadId 0 calls the clock function. Is it therefore possible to measure time in the following way (I use only 1 Block but several threads):
kernel() {
if (threadId == 0) c = clock();
do some memory copying according to threadId
__synchtrhreads();
if (threadId == 0) c = clock() - c;
}
The variable c would then have approximately the execution time for all threads, as I call the __synchthreads() and only thread 0 does a calculation after the __synchthreads().
Or is the example in the clocks project already the easiest way?
Thanks
Sacha