I tried running a kernel on one block of 32 threads with this code:
__syncthreads();
clock1 = clock();
__syncthreads();
clock2 = clock();
__syncthreads();
totalClocks = clock2 - clock1;
The result was 60 clocks, which seems strangely high given that the only code between the clock() calls was a syncthreads()
From the guide:
Can anyone explain where the 60 clock cycles come from?
Many thanks,
Alex
If you have more than a half-warp of threads in a block, some threads will have to wait for others to sync. The larger your block, the longer the wait. You may also be experiencing register read-after-write latency (Section 5.1.2.5 of the Programming Guide).
Paulius
Using only one thread, I get the same result of 60 cycles, so the additional cycles aren’t all from waiting for others. Without the __syncthreads call I get 26 cycles, so the additional cycles aren’t all from read-after-write latency. So does __syncthreads take more than 4 cycles or is something else going on here?