Memory Synchronisation if using dynamic parallelism


I have a question regarding memory synchronization whle using dynamic parallelism:
If a parent kernel threads starts 2 children, both on the same newly created stream, both reading and writing to the same addresses in global memory, what kind of guarantees are made?

My understanding of the documentation is the following:

  1. Children which are started on the same stream will run in serial, but asynchronously to the parent.
  2. Child 2 will see all modifications to global memory made by Child 1 (as they are on the same stream) without any additional synchronization calls.
  3. The children will see all modifications made by the parent to global memory up to the call to launch the children.
  4. The parent will see all modifications by the children after a cudaDeviceSynchronize call.

Is this correct?

For the parent to be able to see the memory modification by its children, does a cudaStreamWaitEvent waiting for them to finish suffice? Or is cudaDeviceSynchronize necessary to empty caches or sth?

Thank you in advance for helping me understand dynamic parallelism :)