Does cuStreamSynchronize(Blocking Stream) wait for the completion of NULL stream? For example,
cuMemcpyDtoHAsync on NULL stream;
cuStreamSynchronize on a user created blocking stream;
When cuStreamSynchronize gets returned, is the async Memcpy ensured to be completed by the GPU?
A blocking stream is the default type of stream created when doing
cudaStreamSynchronize() waits for all issued work to that stream to complete.
In the case where issued work to that created stream is issued both before and after work that is issued to the NULL stream, then the NULL stream work will be completed prior to the subsequent issued created stream work, and therefore prior to the completion of the
cudaStreamSynchronize() that is issued to the created stream after all that.
In the case where you have a sequence (possibly empty) of work issued to a created stream, followed by work issued to the NULL stream, followed by a
cudaStreamSynchronize() call issued to the created stream, I know of nothing in CUDA stream semantics that guarantees that the
cudaStreamSynchronize() call will wait for work issued to the NULL stream.
cuMemcpyDtoHAsync() call issued to a NULL stream may become a blocking call all by itself. Therefore, in this circumstance, the question becomes moot.
I imagine you are asking this to understand behavior. So be it.
From a programming perspective, I would strongly encourage CUDA programmers not to write code that depends on these mental gymnastics to properly understand or predict the behavior. I view that as madness. When teaching CUDA concurrency topics, I usually advise people to leave the NULL stream behind. Any sort of concurrency you desire to orchestrate can be done entirely using created streams, and this approach (in my opinion) makes it easier to sort out and predict expected behavior.
Obviously (as it is my opinion), others may have other opinions.
Let’s review the most fundamental CUDA stream semantics:
work issued to a particular stream is executed in issue order. There can be no overlap of items issued to a particular stream.
work issued to separate created streams have no ordering specified by CUDA. work issued to separate created streams may have the potential to overlap.
The default NULL stream has the following behavior. Work issued to the NULL stream will wait for any previously issued work to that device to complete, before the NULL stream work begins. Work issued to that device after the work issued to the NULL stream, will not begin until the NULL stream work has completed.
Thanks for your great answer, Rob! I do want to understand the behavior.
If cuStreamSynchronize can be understood as an operation, the operations issued to NULL stream before should gets completed before cuStreamSynchronize of the created stream. That’s the source where my confusion was from. :)