Simply removing all __syncthreads() from the code sounds like a voodoo debugging technique, similar to waving a rubber chicken over your monitor.
Try reducing the problem size and adding device-side printf() to the code. I have found any number of bugs in kernels just with a simple log produced with printf(). I would suggest starting with just a couple of printf() calls to avoid overwhelming the ring-buffer used by device-side printf() to communicate with the host.