What will happen when I replay a cuda graph with two streams in a new stream?

youkaichao1 · May 23, 2024, 10:04pm

when I create a cudagraph in stream 1, which also involves stream2, I can end up with having a graph like this, where A-B-D happens in stream 1, C happens in stream 2. Then, when I replay the graph in stream 3 ( a new stream ), what happens to the operation C? does it still remember stream 2? or it creates an internal stream?

striker159 · May 24, 2024, 4:27am

If you capture a graph from stream 1 and stream 2, and execute it, it will never actually run in stream 1 and stream 2. The streams are only used to extract the dependencies between then nodes.

youkaichao1 · May 24, 2024, 4:40am

when I replay the graph in stream 3 ( a new stream )， does the graph internally creates two streams to execute the graph?

striker159 · May 24, 2024, 6:47am

CUDA will correctly handle all dependencies within the graph. How this is achieved is an undocumented implementation detail. You could check with nsight-systems, for example, if other streams are used.

However, just like with ordinary multi-stream code, there is no guarantee that kernels B and C will be executed concurrently within the graph.

youkaichao1 · May 24, 2024, 8:13am

I understand that “there is no guarantee that kernels B and C will be executed concurrently”.

My concern is, if cudagraph does not create new streams, does it flatten the graph into a straight line (e.g. via topology sort), and then kernel B and C will never be executed concurrently?

striker159 · May 24, 2024, 8:25am

No. From my experience, independent kernels are able to run concurrently in the graph.

youkaichao1 · May 24, 2024, 9:31pm

Then I assume cudagraph uses internal mechanism to launch concurrent kernels, rather than using streams.

Robert_Crovella · May 24, 2024, 9:43pm

I think it should be sufficient to say concurrency is possible and provided for by graph capture, either using the stream capture method or the API capture method, without getting into assumptions about internal mechanisms or stream usage at the point of graph execution. The CUDA runtime is readily able to create streams (and many other entities, such as host threads) for its own usage, if that were intended by the CUDA designers. As already mentioned, I agree with the view that the behavior at the point of graph launch in this respect should be thought of as an implementation detail. Even if we were to try and tease out those details, they could change in any event.

Using API capture, one need not worry about streams per-se for the purpose of concurrency. The method of enabling concurrency there has to do with declaration of dependencies, not discussion or specification of streams. The picture, at least, in your original posting is indicating API graph definition.

For stream capture, the situation is more complicated (in my opinion). It is possible to have cross-stream activity but only a particular stream is designated for capture.

youkaichao1 · May 24, 2024, 9:48pm

Thanks for the explanation! As long as “concurrency is possible and provided for by graph capture”, then it is good.

My original concern is, when I execute the graph in some stream, I only need to provide one stream, and I’m afraid that cuda flattens the graph into a straight line (e.g. via topology sort) so that kernels will fit in one stream. That would be bad.

Robert_Crovella · May 24, 2024, 9:50pm

Without getting into implementation details, concurrency is possible.

It’s trivial to demonstrate using API capture, and I have done that in the past. I haven’t tried to do it with stream capture, but my read of the programming guide suggests it is possible.

Topic		Replies	Views
Multi-stream graph CUDA Programming and Performance	3	109	February 5, 2025
cudaGraph Stream Capture CUDA Programming and Performance cuda	1	598	August 15, 2023
Using multi streams in cuda graph, the execution order is uncontrolled CUDA Programming and Performance cuda	11	2770	August 23, 2023
Behavior of cudaGraphInstantiateFlagUseNodePriority CUDA Programming and Performance cuda	7	614	August 23, 2023
Concurrent kernel execution using cuda graph CUDA Programming and Performance	1	1222	June 23, 2019
Multiple independent streams in a graph CUDA Programming and Performance	2	2041	October 7, 2019
Constructing CUDA Graphs with Dynamic Parameters Technical Blog	1	414	August 23, 2022
confusions about CUDA streams CUDA Programming and Performance	5	805	July 30, 2017
How does the graph launch concurrency work with executable graph? CUDA Programming and Performance	3	264	August 31, 2023
CUDA and NPP Misc Issues CUDA Programming and Performance	6	1451	March 28, 2011

What will happen when I replay a cuda graph with two streams in a new stream?

Related topics