understanding clock()

I tried running a kernel on one block of 32 threads with this code:


	clock1 = clock();


	clock2 = clock();


	totalClocks = clock2 - clock1;

The result was 60 clocks, which seems strangely high given that the only code between the clock() calls was a syncthreads()

From the guide:

Can anyone explain where the 60 clock cycles come from?

Many thanks,


If you have more than a half-warp of threads in a block, some threads will have to wait for others to sync. The larger your block, the longer the wait. You may also be experiencing register read-after-write latency (Section of the Programming Guide).


Using only one thread, I get the same result of 60 cycles, so the additional cycles aren’t all from waiting for others. Without the __syncthreads call I get 26 cycles, so the additional cycles aren’t all from read-after-write latency. So does __syncthreads take more than 4 cycles or is something else going on here?