I’m happily computing along in stream-0, and then I want to copy some data off the GPU, to send to a different node in my cluster, ideally while the computation in stream-0 happily continues along.
Here’s what I think I need to do:
- Compute data “d”.
- Create stream “s”.
- Prepare a packet “p” out of “d” within “s”.
- Initiate copy of “p” into page-locked “p_cpu” within “s”.
- Upon completion of “s”, p_cpu is ready to use.
My question now relates to step 1): Am I guaranteed that the computation on “d” is finished before step 2) starts?
In other words, is stream creation a barrier to reordering? If not, what’s the proper way to achieve this “stream-branching” behavior?
Thanks,
Andreas