I tried running a kernel on one block of 32 threads with this code:
__syncthreads(); clock1 = clock(); __syncthreads(); clock2 = clock(); __syncthreads(); totalClocks = clock2 - clock1;
The result was 60 clocks, which seems strangely high given that the only code between the clock() calls was a syncthreads()
From the guide:
Can anyone explain where the 60 clock cycles come from?