CUDA-Graph inter-graph events

Hello, I’d like clarification on how event recording and wait nodes
work in CUDA-Graph. In particular as inter-graph dependencies between
the two graph objects. I’m trying to understand how I can write code
that will replicate the behavior from
Figure 5a in https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ where two graphs are executing concurrently with back and forth synchronization between nodes inside the graph.

To use a small concrete example for discussion. If I create and instantiate
two graphs, then launch Graph 1 following by Graph 2 to separate streams.

Graph 1: (a) memset → (b) Record E1 → (c) Wait E2 → (d) memset
Graph 2: (a) Wait E1 → (b) Kernel → (c) Record E2.

I’d like overall execution order of nodes to be:

1a → 1b → 2a → 2b → 2c → 1c → 1d.

However, based on this wording in CUDA Runtime API :: CUDA Toolkit Documentation it’s not clear to me how this can be achieved:

cudaEventRecord() can be called multiple times on the
same event and will overwrite the previously captured state. Other APIs such as cudaStreamWaitEvent() use the most recently captured state at the time of the API call, and are not affected by later calls to cudaEventRecord(). Before the first call to cudaEventRecord(), an event represents an empty set of work, so for example cudaEventQuery() would return cudaSuccess.

As If Graph 2 starts execution before G1 then 2a could execute immediately before 1b is run. While if Graph 1 executes first a valid order of execution would be 1c running before 2c, completing immediately as the event has not captured any dependencies.

Are there some defined semantics for when an event is reset to a non-ready state after a graph is launched but before the record node is executed? Otherwise I’m not sure how to achieve my desired behavior.

I don’t see a figure 5a in that blog. I do see a figure 5 which has 3 sections to it.

I don’t think that is being depicted. Figure 4 shows two GPUs interacting via a single graph or via single graph per timestep (Figure 5, center and right sections). Note the statements in the text as well:

For multi-GPU, use a single graph across all GPUs. So far, this is only supported with thread-MPI, where the multi-GPU graph is defined by exploiting the natural ability of CUDA to fork and join streams across different GPUs within the same process (using event-based GPU-side synchronization) and to automatically capture such workflows into a single graph.

(emphasis added)

The execution order you have suggested:

1a → 1b → 2a → 2b → 2c → 1c → 1d.

Is most readily expressed as a single graph with suitable dependencies expressed in the graph (or a single stream, in the non-graph case).

There are “one-way” intergraph dependencies expressed in the blog, for example in the description of the odd-even timestep handling. As indicated in the article:

We overcame this problem by using a separate graph on odd and even steps (right), where these are linked using “external” CUDA events which can be recorded within one graph and enqueued within another (depicted by grey arrows), effectively overlapping the extra synchronization.

but this represents a dependency of the second timestep (second graph) on the first timestep (first graph). There is no dependency expressed of the first timestep (first graph) on the second timestep (second graph).

Thanks for taking the time to reply @Robert_Crovella ,

Apologies, yes I did mean the graphic in Figure 5, and how to achieve the righthandmost behavior. Thanks for clarifying my understand on the blog article that it’s a single multi-device graph, with external event synchronization between the odd/even time steps.

Would you say that bi-directional sync within concurrently directional graphs is not a use-case supported by event nodes? Or is there another way to use the CUDA-Graph API that would achieve this?

Even for 1-way inter-graph dependencies, if the odd/even executable graphs are launched on separate streams could you clarify the execution model. Reducing my example to 1-way sync between graph nodes:

Timestep 1 Graph: (a) kernel → (b) record E1 → (c) memset
Timestep 2 Graph: (a) wait E1 → (b) kernel.

If you launch Graph 1 followed by Graph 2 to separate streams with no other synchronization - are there any guarantees in the execution model that 2(b) won’t execute before 1(b)? Especially if the 1(a) kernel was long running, resulting in the empty E1 event completing immediately when the timestep 2 graph runs concurrently.

If other synchronization is required to guarantee the desired 1-way sync, what would that be? As submitting the odd/even graphs to the same stream seems like it would serialize the executions in-order and lose concurrency performance benefits.

That is mentioned in the article. In fact I already excerpted it.

The CUDA event is declared outside of the first graph, but the graphs can use it (record and complete events). It is declared outside the first graph so that it is not “owned” by the graph and can be used from one graph to the next, as a signalling mechanism. If you’d like to know precisely how it is implemented, I won’t be able to give you a recipe right now. Time permitting, I may be able to take another look in the future. In the meantime, this article is referring to refactoring of gromacs, which AFAIK is “open source” or at least source-available for inspection. The article makes specific reference to this possibility:

For full technical details, see the GitLab Issue, Implement CUDA Graph Functionality and Perform Associated Refactoring, and the merge requests linked therein.

I don’t know how to do it. I don’t think it is possible using only events. The issue with events is that the status of an event (recorded, but not completed, vs. recorded and completed) is undefined when an event is created and not yet recorded. In my view this makes 2-way sync with only events and no other sync mechanisms to be quite difficult. Perhaps someone else has a clever idea.

cuda graphs offer conditional nodes as well as the ability to use external semaphores. A combination of these could probably achieve the graph interlocking. However I remain convinced that this is likely to lead to resource-intensive inefficient use of the GPU, and that the operation dependencies would be better expressed via a single graph.

For example, you asked for:

1a → 1b → 2a → 2b → 2c → 1c → 1d.

That could be accomplished in a straightforward fashion by defining graph 1 and graph 2, and launching graph 2 as child graph, from the appropriate point in graph 1.

Thanks for the replies Robert. Those are interesting ideas of conditional nodes + semaphores, I need to read up on them.

This thread was primary intended to help understand event semantics better in graphs, and I think I now understand that when cudaGraphLaunch is called it synchronously resets any events associated with recording nodes to the cudaErrorNotReady state before kicking off the async work. Is that the correct understanding of the execution model?

I didn’t intend to communicate that and don’t think I stated that anywhere. I’m fairly certain that’s not the case, and I think if that were the case, then the previously excerpted methodology from the blog could not possibly work. The event state must be able to be communicated from one graph to another.

Although you didn’t communicate this, I arrived at this conclusion as it’s the only way I could see the 1-directional use case reduced from GROMACS I described earlier having guaranteed correct semantics:

Timestep 1 Graph: (a) kernel → (b) record E1 → (c) memset
Timestep 2 Graph: (a) wait E1 → (b) kernel.

Before anything is launched, E1 is in the empty and complete state. If a user did cudaGraphLaunch(G1, S1); cudaGraphLaunch(G2, S2); Then if cudaGraphLaunch G1 did synchronously reset the event E1, then the desired execution order would occur regardless of if G1 executes eagerly
with respect to a G2 launch, and 2(a) could no longer finish too early before
1(b).

Given my assumption is not correct, could you describe or point me to the documentation specifying or when an event changes state to cudaErrorNotReady when used in a graph recording node?

I’m reading Each launch of the graph will record event to capture execution of the node's dependencies. in the description of cudaGraphAddEventRecordNode which implied to me that when you launch the graph, none of the nodes have executed yet, and so an event must immediately be set to the not-ready state.

I should retract or edit my previous statement; your reasoning is correct and pretty well stated. I’d like to provide additional clarification and for continuity of the thread I’ve elected to not modify my previous post, but instead provide this post:

There are several phases in an event lifetime:

  1. created but not yet recorded
  2. recorded but not yet completed
  3. completed

Phase 1 is entered upon the cudaEventCreate() call. I referred to this previously as an “undefined” state for the event, that is, it is not well defined what the result of doing cudaEventQuery() on such an event will produce. According to my testing, it will produce cudaSuccess, i.e. it is as if the event is in phase 3.

Phase 2 is entered when the cudaEventRecord() call is encountered. In this phase, the result of doing cudaEventQuery() on the event will produce cudaErrorNotReady.

Phase 3 is entered when the stream processing has reached the point at which the event was recorded. (Or in our subsequent treatment, in graph processing when the graph processing has reached the node where the cudaEventRecord took place - this is confusing to state.) In this phase, the result of doing cudaEventQuery() is cudaSuccess.

My treatment of phase 1 is arguable. You might consider it to be defined, since the description given for cudaEventQuery() is:

Returns cudaSuccess if all captured work has been completed, or cudaErrorNotReady if any captured work is incomplete.

In phase 1, the event has not been recorded, therefore it has captured no work. So if you have that viewpoint, then we can say it is well-defined, and cudaEventQuery() should return cudaSuccess. According to my testing, that is the observed behavior. Regardless, a hazard exists: the Phase 1 state is indistinguishable (from the viewpoint of cudaEventQuery()) from the Phase 3 state. Therefore, before we can reliably determine the difference between phase 3 and phase 2, we must be certain that the event is not in phase 1. So in order to accomplish signaling that makes sense to me, we must ensure that the event is recorded, before we start to attempt to use cudaEventQuery to determine whether the “captured work is completed”. This looks to me like additional synchronization of some sort is needed, beyond what is expressly provided via cudaEvent usage.

When we switch to graph handling, in my view the above concepts do not change. However we should ask “when exactly does the event get recorded” for a graph? It seems to me that the statement:

Each launch of the graph will record event to capture execution of the node’s dependencies.

should be interpreted at face value. The graph launch is effectively what records the event, and the event so recorded “captures” the previous work, whatever that means for the dependencies expressed in the graph. The graph launch provides that “extra” synchronization I referred to earlier; it guarantees to move the event from phase 1 to phase 2.

Once we understand that, then I believe the description given in the blog article makes sense.

  1. A cudaEvent is created outside of any graph activity
  2. The first graph has a record node in it, that records the event from item 1. The launch of the first graph effectively records this event, moving its phase from 1 to 2. We thus avoid the ambiguity between phase 1 and phase 3.
  3. The launch of the second graph does not inherently modify the state or phase of the event from item 1 in any way. That event is in phase 2, or phase 3 depending on the processing that has taken place in graph 1, and graph 2 can reliably observe that state/phase.

So after running through all that description, I apologize if I miscommunicated. If the graph has a record node in it, then we should assume that the launch of the graph will set that event state/phase to phase 2, which is the cudaErrorNotReady state. Which is what you said. But this does not mean that:

  1. A graph that doesn’t have a record node for that event would modify that event state in any way at its launch (such as the subsequent graph - which observes the event but presumably may not have a record node in it)
  2. The state (even a nanosecond after the launch) is guaranteed to be cudaErrorNotReady. The state is guaranteed to be either cudaNotReady (i.e. Phase 2) or cudaSuccess(i.e. Phase 3).

The state after launch of graph 1 could conceivably be phase 3/cudaSuccess for example, if the a node in graph 1 with no prior dependences was the record node. The event would transition immediately upon being recorded, to the completed/cudaSuccess state.

Again, you were correct in your statement. However, for clarity, I do not believe it is wise for me to communicate that a graph launch with a record node is guaranteed to put the event into a cudaErrorNotReady state. From an observational standpoint, it is guaranteed to put it into either a cudaErrorNotReady state, or a cudaSuccess state, i.e. either into Phase 2 or Phase 3. We could also say the launch is guaranteed to put it into at least the cudaErrorNotReady state, which is again approximately what you said, so I apologize for my lack of clarity.

1 Like

Thanks Robert, that’s a very helpful analysis of the semantics. No worries about any miscommunication 🙂

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.