I’m just trying to understand this a little better. Obviously, everyone knows, 0 synch means things are going to run as fast as possible, but reading the documentation for CUDA and PTX, it seems like there is a bigger hit than just a thread execution unit sitting idle for a time. Am I reading this right? Has anyone benchmarked this? Is there really that much overhead?
I think I read that __syncthreads() takes 6 cycles, so cost is pretty low if all threads are doing much the same thing.
Thread synch is often associated with waiting for data going to/from global memory, in those cases if you have more than 1 block per MP the transfer latency may be hidden if another block can run.
Another case I can think of is where some threads execute one branch of code (long pathway) and others a shorter branch, and the __syncthreads() comes when their paths come together again. In this case it should help if all threads in most warps are the same ‘length’ (i.e. either all long or all short), though sometimes not worth the effort of ensuring that.