Any way to guarentee writes have made it to global memory?

I have a random question regarding an attempt to try and force some ordering of block execution. I know that global synchronization is not really supported, but for one particular application I am trying to port to the GPU, I want to try this. Here is the basic idea simplified to explain what I want to do.

I want to launch a kernel with 3 blocks (A, B, and C, lets say). These are sufficiently large enough blocks that only one will fit per MP (using cuda occupancy calculator). Lets say A and B each run for 2 seconds, and C runs for approx. 4 seconds. Now the problem is, some of C’s inputs are dependent on B and A’s outputs.

My idea was as follows-> launch all A, B, C at the same time. C can do 2 seconds of work until it needs A&B’s outputs. Here it spin-waits on a global memory variable. A and B write their results to global memory space. After they are done, they each atomically increment aforementioned global variable. When C sees both have written to the global memory space, it can use the results and finish up. Ideally this would take 4 seconds + a bit of overhead, as opposed to 2s+4s + 1 extra kernel launch if I do a 2-kernel launch approach.

However, Is there any way to guarantee that all of A&B’s outputs have gone to memory BEFORE the atomic variable update? Can I assume all global writes happen in order of my program WITHIN a block?

Sounds like it might be a job for __threadfence(). There is an example of it in the SDK with a context of reduction for a similar usage pattern.