Enforcing host-to-device Write Visibility Order

The context: I have a long-running CUDA kernel that:

  1. spin on a int32 buffer A, until it becomes 1
  2. consume the data from buffer B

A therefore serves as a signal to kernel that data on B is ready. In this scenario, host code will do the following, and the writes to B must be visible to the kernel before writes to A :

  1. write data to B
  2. write 1 to A

What is the best way to achieve this? What mechanism should these writes take place through? (unified memory? GDR copy?) Is there any cuda API I can call to flush the writes to B before signalling via A?