I have a kernel where I have a blocksize of 32 to prevent the need to use syncthreads() because padding is quite complicated in this case. But here is my surprise:
for all threads where threadIdx.x == 0, everything is working OK. But for all threads with threadIdx.x==1 … 31, I get incorrect results. When I insert __syncthreads() in the places where I would if I would have a different blocksize, the results of those threads are correct!!!
Did anybody else observe this behaviour?
I will try to reproduce it with a small example and file a bug if I manage to do so.