Branching Streams cuStreamCreate vs reordering

inducer · May 7, 2009, 4:53pm

I’m happily computing along in stream-0, and then I want to copy some data off the GPU, to send to a different node in my cluster, ideally while the computation in stream-0 happily continues along.

Here’s what I think I need to do:

Compute data “d”.
Create stream “s”.
Prepare a packet “p” out of “d” within “s”.
Initiate copy of “p” into page-locked “p_cpu” within “s”.
Upon completion of “s”, p_cpu is ready to use.

My question now relates to step 1): Am I guaranteed that the computation on “d” is finished before step 2) starts?

In other words, is stream creation a barrier to reordering? If not, what’s the proper way to achieve this “stream-branching” behavior?

Thanks,
Andreas