Execution order between Cuda Stream 0 and other streams

star.zxx · July 9, 2020, 6:51am

Assume we have couple device to device memcpy happening in the following sequence on CPU,
cuMemcpy2D (srcA to dstA, device to device, default stream) → cuMemcpy2D (dstA to dstB, device to device, stream 1)
We know these two copies cannot overlap (stream 1 is created without the non blocking flag), but is the ordering on device guaranteed?
Is the first memcpy on default stream guaranteed to happen and finish before the second memcpy?