I have a following question about whether or not should I be using any synchronization instructions (__syncthreads() or __threadfence()) in the following case.
I’m dealing with a stream compaction problem and the work is organized in a such way that one warp is processing the whole image row (3072 pixels), so a warp iterates several times in order to process a full row. The output is written into another line in a global memory not shared) and only the same warp can write into this line.
Lets say that at i iteration for a given warp, I write directly into output position 123, that is out. In the next iteration i+1 I might need to rewrite a value at out.
My question is whether it is sure to update the out value this way without using any of the synchronization instructions?
Thanks in advance,