Branching Streams cuStreamCreate vs reordering

I’m happily computing along in stream-0, and then I want to copy some data off the GPU, to send to a different node in my cluster, ideally while the computation in stream-0 happily continues along.

Here’s what I think I need to do:

  1. Compute data “d”.
  2. Create stream “s”.
  3. Prepare a packet “p” out of “d” within “s”.
  4. Initiate copy of “p” into page-locked “p_cpu” within “s”.
  5. Upon completion of “s”, p_cpu is ready to use.

My question now relates to step 1): Am I guaranteed that the computation on “d” is finished before step 2) starts?

In other words, is stream creation a barrier to reordering? If not, what’s the proper way to achieve this “stream-branching” behavior?