I have thread block size is 16x16. I tried to:
- use __syncthreads() to sync all (256) threads;
- use “bar.sync 1, 256” to sync 256 threads.
I saw a performance improvement by using the second one. Any one can tell me whether it is reasonable and how to understand it? Can bar.sync guarantee the consistency of shared memory?