CUDA synchronization question

Hello all!

I have a following question about whether or not should I be using any synchronization instructions (__syncthreads() or __threadfence()) in the following case.

I’m dealing with a stream compaction problem and the work is organized in a such way that one warp is processing the whole image row (3072 pixels), so a warp iterates several times in order to process a full row. The output is written into another line in a global memory not shared) and only the same warp can write into this line.

Lets say that at i iteration for a given warp, I write directly into output position 123, that is out[123]. In the next iteration i+1 I might need to rewrite a value at out[123].

My question is whether it is sure to update the out[123] value this way without using any of the synchronization instructions?

Thanks in advance,


if you intend to update as below:

x = out[123]; //
x = x + 1; // 2.modify
out[123] = x; // 3.write

no other thread can read/write out[123] while doing 1,2,3.
in such case, consider to use atomicCAS.