Cuda global reads/writes in cooperative kernel

Hello,

suppose we make a grid group grid inside a cooperative kernel. We can call grid.sync() to synchronize the group. I am curious, if this also ensures that every thread is aware of write ops to global memory. In other words, is global memory in consistent state across threads after grid.sync()?

Suppose the following code snippet:

void __global__ my_coop_kernel(int * nums) {
    auto grid = cooperative_groups::this_grid();
    nums[grid.thread_rank()] = grid.thread_rank();
    grid.sync();
    if (grid.thread_rank() == 0) {
    for (auto i = 0; i < grid.size(); ++i)
        printf("%d\n", nums[i]); // Is this going to be correct?
    }
}

Best Regards,

Draft

Yes, it will print the correct numbers. If you take a look at the ptx code, Compiler Explorer
it shows a global memory barrier (membar.gl) between writing the values and accessing the values.

The barrier disappears if grid.sync() is removed.