question about __syncthreads and bar.sync


I have thread block size is 16x16. I tried to:

  1. use __syncthreads() to sync all (256) threads;


  1. use “bar.sync 1, 256” to sync 256 threads.

I saw a performance improvement by using the second one. Any one can tell me whether it is reasonable and how to understand it? Can bar.sync guarantee the consistency of shared memory?


Are you inlining the bar.sync with asm(“bar.sync 1, 256”) or asm volatile(“bar.sync 1, 256”)?
In the former version the compiler is allowed to reorder instructions around the barrier and thus has more opportunities for (potentially incorrect) optimization than with __syncthreads().