understanding clock()

I tried running a kernel on one block of 32 threads with this code:

__syncthreads();

	clock1 = clock();

	__syncthreads();

	clock2 = clock();

	__syncthreads();

	totalClocks = clock2 - clock1;

The result was 60 clocks, which seems strangely high given that the only code between the clock() calls was a syncthreads()

From the guide:

Can anyone explain where the 60 clock cycles come from?

Many thanks,

Alex

If you have more than a half-warp of threads in a block, some threads will have to wait for others to sync. The larger your block, the longer the wait. You may also be experiencing register read-after-write latency (Section 5.1.2.5 of the Programming Guide).

Paulius

Using only one thread, I get the same result of 60 cycles, so the additional cycles aren’t all from waiting for others. Without the __syncthreads call I get 26 cycles, so the additional cycles aren’t all from read-after-write latency. So does __syncthreads take more than 4 cycles or is something else going on here?