The context: I have a long-running CUDA kernel that:
- spin on a int32 buffer A, until it becomes 1
- consume the data from buffer B
A therefore serves as a signal to kernel that data on B is ready. In this scenario, host code will do the following, and the writes to B must be visible to the kernel before writes to A :
- write data to B
- write 1 to A
What is the best way to achieve this? What mechanism should these writes take place through? (unified memory? GDR copy?) Is there any cuda API I can call to flush the writes to B before signalling via A?