CUDA memory model & threadfence

I am reading the book “Programming Massively Parallel Processors” and noticed the below code snippets to achieve “domino-style” scan:

if (threadIdx.x == 0) {
    while(AtomicAdd(&flags[bid], 0) == 0) {}
    // ???? why do I not need thread fence here (for acquire semantic) to prevent load-load reordering between the loads for flags[bid] and scan_value[bid]? 
    previous_sum = scan_value[bid];
    scan_value[bid+1] = previous_sum + local_sum;
    __threadfence(); // why the "release" fence here is sufficient?
    atomicAdd(&flags[bid + 1], 1);
}

According to the book , the __threadfence() is required between the two writes to ensure scacn_value[bid+1] is written to global memory before flags[bid+1] is incremented, which seems to resemble the release semantic in C++ memory model (except that __threadfence has seq_cst semantics).

However, I am unable to understand why a similar __threadfence (or acquire fence) is not needed between the two reads to prevent load-load reordering (i.e. scan_value[bid] being read before flags[bid])? Thanks!

Yes, I understand that is your post and you already know about it. I’m mainly posting here for others who may visit.

1 Like