I don’t know what you expect to happen (would be useful if you posted the CPU code as well), and what you were hoping to achieve with the [font=“Courier New”]__threadfence()[/font].
However, if [font=“Courier New”]WIDTH[/font] is the blocksize you are calling the kernel with, you have out-of-bounds accesses in shared memory. You would need to declare C and S as
As I understand it, the threadfence instructions only effects active threads, so there is no harm in using threadfence where there might be warp divergence.
I think the race condition is due adjacent threads reading and writing the same elements of X. (I would paste the problematic lines, but the code was included as a picture.)
In one line, X[idx] is read, and in a later line X[j] is written to, and j = idx + 1. This turns out to be OK, except at the warp boundaries. Then there is a possibility of thread 32 reading X[idx] before or after thread 31 has modified it.
Your guess is right __syncthreads synchronize threads for read/write ops in shared memory only and with a block, so i tried to use something like __threadfence(); - it can be used with global memory ops and shared memory too
Yea you are right, the problem with X & Y arrays. I should modify their elements with thread j then read them with thread idx next time, therefore i is used __threadfence() trying to synch. but nothing happened :(.Do you have any other solution ??? - ohhh and here is the code in text :)