Cuda global reads/writes in cooperative kernel


suppose we make a grid group grid inside a cooperative kernel. We can call grid.sync() to synchronize the group. I am curious, if this also ensures that every thread is aware of write ops to global memory. In other words, is global memory in consistent state across threads after grid.sync()?

Suppose the following code snippet:

void __global__ my_coop_kernel(int * nums) {
    auto grid = cooperative_groups::this_grid();
    nums[grid.thread_rank()] = grid.thread_rank();
    if (grid.thread_rank() == 0) {
    for (auto i = 0; i < grid.size(); ++i)
        printf("%d\n", nums[i]); // Is this going to be correct?

Best Regards,


Yes, it will print the correct numbers. If you take a look at the ptx code,
it shows a global memory barrier ( between writing the values and accessing the values.

The barrier disappears if grid.sync() is removed.