I think the race condition is due adjacent threads reading and writing the same elements of X. (I would paste the problematic lines, but the code was included as a picture.)
In one line, X[idx] is read, and in a later line X[j] is written to, and j = idx + 1. This turns out to be OK, except at the warp boundaries. Then there is a possibility of thread 32 reading X[idx] before or after thread 31 has modified it.
Your guess is right __syncthreads synchronize threads for read/write ops in shared memory only and with a block, so i tried to use something like __threadfence(); - it can be used with global memory ops and shared memory too
Yea you are right, the problem with X & Y arrays. I should modify their elements with thread j then read them with thread idx next time, therefore i is used __threadfence() trying to synch. but nothing happened :(.Do you have any other solution ??? - ohhh and here is the code in text :)