Hello,
I have thread block size is 16x16. I tried to:
- use __syncthreads() to sync all (256) threads;
or
- use “bar.sync 1, 256” to sync 256 threads.
I saw a performance improvement by using the second one. Any one can tell me whether it is reasonable and how to understand it? Can bar.sync guarantee the consistency of shared memory?
Thanks,
Susan