I am reading the book “Programming Massively Parallel Processors” and noticed the below code snippets to achieve “domino-style” scan:
if (threadIdx.x == 0) {
while(AtomicAdd(&flags[bid], 0) == 0) {}
// ???? why do I not need thread fence here (for acquire semantic) to prevent load-load reordering between the loads for flags[bid] and scan_value[bid]?
previous_sum = scan_value[bid];
scan_value[bid+1] = previous_sum + local_sum;
__threadfence(); // why the "release" fence here is sufficient?
atomicAdd(&flags[bid + 1], 1);
}
According to the book , the __threadfence()
is required between the two writes to ensure scacn_value[bid+1]
is written to global memory before flags[bid+1]
is incremented, which seems to resemble the release semantic in C++ memory model (except that __threadfence has seq_cst semantics).
However, I am unable to understand why a similar __threadfence
(or acquire fence) is not needed between the two reads to prevent load-load reordering (i.e. scan_value[bid]
being read before flags[bid]
)? Thanks!