I have a single block with 30x34, threads which communicate using shared memory protected by multiple __syncthreads() calls.
There appears to be some kind of race condition giving me different answers (sometimes correct) depending on the call (identical inputs).
Basically it is just doing a complex matrix multiplication followed by division by the mean(abs) in a loop 10 times.
It appears that some threads are getting ahead of others in a random fashion, despite the __syncthreads() barriers.
None of __syncthreads() calls are in the binary reduction conditionals.
Does a block have to be confined to one warp for __syncthreads to work?