Questions about cuda memory model: does causality order remain transitive across different scopes

Hi, we recently encountered some CUDA memory model related issues when doing cross-CTA communication.

We are seeking an authoritative answer from CUDA memory model experts.

Specifically, we want to know whether causality order remain transitive across different scopes.

The specific case is as follows:

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
                    float* result)
{
    // Each block sums a subset of the input array.
    float partialSum = calculatePartialSum(array, N); 

    if (threadIdx.x == 0) {

        // Thread 0 of each block stores the partial sum
        // to global memory.
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure that the incrementation
        // of the "count" variable is only performed after
        // the partial sum has been written to global memory.
        __threadfence();

        // Thread 0 signals that it is done.
        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 determines if its block is the last
        // block to be done.
        isLastBlockDone = (value == (gridDim.x - 1));
    }   

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone.
    __syncthreads();

    if (isLastBlockDone) {

        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0) {

            // Thread 0 of last block stores the total sum
            // to global memory and resets the count
            // varialble, so that the next kernel call
            // works properly.
            result[0] = totalSum;
            count = 0;
        }   
    }   
}

In the above case, the result array is not declared as volatile, so they may be cached incoherently in L1.

So we want to know, according to CUDA memory model, when the last block executes calculateTotalSum in line 36, will it read out other CTA’s partial sum safely?

We suspect this has something to do with causality order transitivity across different scopes:

With threadfence(line 18) and atomic operations(line 21), causality order is established in gpu scope between other block’s write to result array(line 13)and last block’s write to the isLastBlockDone flag(line 25):

write_result_array → write_isLastBlockDone.

With __syncthreads(line 30), causality order is established in cta scope between write isLastBlockDone(line 25) and read result array(line 36):

write_isLastBlockDone → read_result_array.

Can the causality order maintain transitive across different scopes according to the cuda memory model?

For example, in our case, does the following causality order hold?:

write_result_array → write_isLastBlockDone → read_result_array.

also see here.

Haha, that stack overflow question is also posted by me.
I hope to get an authoritive answer from NVIDIA experts.

Yes, I didn’t add my comment for your benefit. The postings are basically identical. I was not wondering if it was your post.

I was commenting for the benefit of others. If you have cross-posted a question, and I was studying it, I would want to know what comments had been made in other places that you posted it.

Someone point me to this cuda sample: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions A similar code snippet is somewhere near section “7.6. Synchronization Functions”. The only difference is the volatile modifier on result global buffer. I want to know, whether the volatile modifier is optional or must?