Force flush to global memory on grid level in cooperative kernels

I was wondering if there is a function similar to __syncthreads() that ensures that all threads in a cooperative kernel “flush” their caches so that following reads are really from global memory.

In the programming guide nothing about memory is mentioned when using grid.sync()

Does __syncthreads() and a following grid.sync() already do this?

The reason for my question is that I have a program which runs in stages. During the stage execution a set of variables is only heavily read from (caching would be nice). At the end of each stage (after all threads have finished) these variables are updated and do not change until the end of the next stage.

If I could force that after each stage all threads in all active threadsblocks read the variables from global memory (at least once), I could avoid using volatile.

Cheers!

Various memory barriers and characteristics are discussed in the programming guide.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

I’m not aware that any of these are abrogated in the presence of a cooperative kernel launch.

__syncthreads() only guarantees visibility within threads in a block, not between blocks.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

Note that you can have two different pointers referencing the same location in memory. If one of those pointers is marked volatile, then volatile load/store rules will be applied, when that pointer is used.

As I understand it threadfences only help with the memory read/write order as well as visibility. I do not know if this visibility guarantee also includes a cache update/bypass.

This is brilliant but I thought addresses get cached not variables/pointers. E.g. if the same address gets accessed by two different pointers by dereferencing them, they still can make use of caching since the cache will realize the same address is used.

I can use a volatile pointer for an actual write to global memory. But even if I create a new pointer after this write and dereference it, it could still return a cached outdated version of the memory for this address. The cache might still contain a value from the previous stage execution for this particular value. I don’t think a volatile write would update the cache. Or am I mistaken?

“volatile” means: a data object so tagged can change at any time due to an external agent outside the scope of the present code. Classical examples include data objects representing hardware registers such as clocks, or data updated by interrupt service routines. The use of volatile does not represent a synchronization or memory barrier mechanism.

The volatile modifier does not interact with hardware caches, correct. All it does is force the compiler to generate an actual access to the memory object so tagged every time such an access would occur in the abstract machine defined by the C++ standard. In practical terms, this often prevents the carrying around of such data in registers for extended stretches, and by extension may inhibit the motion of loads in generated machine code.

Generally speaking, uses of volatile in C++ code should be be very rare, and where they occur they may be necessary but not sufficient to solve the particular issue at hand. As far as I am concerned, historically the volatile modifier has been overused (if not to say, abused) in CUDA programming in an attempt to address issues for which volatile was not designed. I consider such idioms workaround / hacks, and they should be considered brittle code.

I believe the C++ standard is going deprecate some uses of the term and because of the reasons you mentioned in your last paragraph.

I only wanted to use volatile as a tool for block communication. E.g., some blocks will write to volatile memory, some will read from it. I was worried that the caches might interfere in this, so volatile would avoid it. However, I’ve just read in the programming guide that the L1 cache (1 per SM) is only used for read-only addresses. And the L2 cache is used by all streaming multiprocessors alike. So if the variable with which I’m accessing the global memory is not const/read-only, I should not need to worry about caches interfering.