Cooperative groups grid sync + global write issue

I have a cooperative group launch, and somewhere in the launch i do

  • Write to global memory from threads
  • Grid sync
  • Read value from global memory from different threads
  • Do something with value
  • Printf value

The printed value is wrong, if I instead do:

  • Write to global memory from threads
  • Grid sync
  • Read value from global memory from different threads
  • Printf value
  • Do something with value
  • Printf value

The printed value is correct.

I don’t understand how this can be the case as the grid sync is between the write and the read. Is there something I might be missing? Does a grid sync only guarantee sync of threads and not global memory?

Also, I get the grid value with “cg::grid_group grid = cg::this_grid();” and sync with “cg::sync(grid);”

be sure you have met all the requirements for a proper cooperative grid launch as indicated in the programming guide:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg[/url]

Oh, don’t know why I didn’t consider that, thanks!

So I had something like:

data[id] = value;

cg::sync(grid);

value = data[id]

I made a pointer to data like “volatile float* vdata = data;” and peppered in __threadfence()'s and i’m still getting versions of the same issue. Is there something extra I might need?

Looking into this more, why does the reductionMultiBlockCG example not require a volatile or threadfence for the global memory reduction at the end?

As far as I know, it is not documented anywhere that a grid sync provides a memory barrier. However my previous statement that it did not may be incorrect (and the reductionMultiBlockCG code would seem to suggest that).

So I edited my previous statement.

I’m not able to speculate why your code is not working the way you would like.

If you provide a complete, testable code, perhaps someone will be able to help you. Note that this doesn’t have to be and probably shouldn’t be your whole code. Instead, create a standalone complete, compilable example, that demonstrates the issue, but has extraneous items removed.