I should retract or edit my previous statement; your reasoning is correct and pretty well stated. I’d like to provide additional clarification and for continuity of the thread I’ve elected to not modify my previous post, but instead provide this post:
There are several phases in an event lifetime:
- created but not yet recorded
- recorded but not yet completed
- completed
Phase 1 is entered upon the cudaEventCreate()
call. I referred to this previously as an “undefined” state for the event, that is, it is not well defined what the result of doing cudaEventQuery()
on such an event will produce. According to my testing, it will produce cudaSuccess
, i.e. it is as if the event is in phase 3.
Phase 2 is entered when the cudaEventRecord()
call is encountered. In this phase, the result of doing cudaEventQuery()
on the event will produce cudaErrorNotReady
.
Phase 3 is entered when the stream processing has reached the point at which the event was recorded. (Or in our subsequent treatment, in graph processing when the graph processing has reached the node where the cudaEventRecord
took place - this is confusing to state.) In this phase, the result of doing cudaEventQuery()
is cudaSuccess
.
My treatment of phase 1 is arguable. You might consider it to be defined, since the description given for cudaEventQuery()
is:
Returns cudaSuccess if all captured work has been completed, or cudaErrorNotReady if any captured work is incomplete.
In phase 1, the event has not been recorded, therefore it has captured no work. So if you have that viewpoint, then we can say it is well-defined, and cudaEventQuery()
should return cudaSuccess
. According to my testing, that is the observed behavior. Regardless, a hazard exists: the Phase 1 state is indistinguishable (from the viewpoint of cudaEventQuery()
) from the Phase 3 state. Therefore, before we can reliably determine the difference between phase 3 and phase 2, we must be certain that the event is not in phase 1. So in order to accomplish signaling that makes sense to me, we must ensure that the event is recorded, before we start to attempt to use cudaEventQuery
to determine whether the “captured work is completed”. This looks to me like additional synchronization of some sort is needed, beyond what is expressly provided via cudaEvent
usage.
When we switch to graph handling, in my view the above concepts do not change. However we should ask “when exactly does the event get recorded” for a graph? It seems to me that the statement:
Each launch of the graph will record event to capture execution of the node’s dependencies.
should be interpreted at face value. The graph launch is effectively what records the event, and the event so recorded “captures” the previous work, whatever that means for the dependencies expressed in the graph. The graph launch provides that “extra” synchronization I referred to earlier; it guarantees to move the event from phase 1 to phase 2.
Once we understand that, then I believe the description given in the blog article makes sense.
- A
cudaEvent
is created outside of any graph activity
- The first graph has a record node in it, that records the event from item 1. The launch of the first graph effectively records this event, moving its phase from 1 to 2. We thus avoid the ambiguity between phase 1 and phase 3.
- The launch of the second graph does not inherently modify the state or phase of the event from item 1 in any way. That event is in phase 2, or phase 3 depending on the processing that has taken place in graph 1, and graph 2 can reliably observe that state/phase.
So after running through all that description, I apologize if I miscommunicated. If the graph has a record node in it, then we should assume that the launch of the graph will set that event state/phase to phase 2, which is the cudaErrorNotReady
state. Which is what you said. But this does not mean that:
- A graph that doesn’t have a record node for that event would modify that event state in any way at its launch (such as the subsequent graph - which observes the event but presumably may not have a record node in it)
- The state (even a nanosecond after the launch) is guaranteed to be
cudaErrorNotReady
. The state is guaranteed to be either cudaNotReady
(i.e. Phase 2) or cudaSuccess
(i.e. Phase 3).
The state after launch of graph 1 could conceivably be phase 3/cudaSuccess
for example, if the a node in graph 1 with no prior dependences was the record node. The event would transition immediately upon being recorded, to the completed/cudaSuccess
state.
Again, you were correct in your statement. However, for clarity, I do not believe it is wise for me to communicate that a graph launch with a record node is guaranteed to put the event into a cudaErrorNotReady
state. From an observational standpoint, it is guaranteed to put it into either a cudaErrorNotReady
state, or a cudaSuccess
state, i.e. either into Phase 2 or Phase 3. We could also say the launch is guaranteed to put it into at least the cudaErrorNotReady
state, which is again approximately what you said, so I apologize for my lack of clarity.