when I create a cudagraph in stream 1, which also involves stream2, I can end up with having a graph like this, where A-B-D happens in stream 1, C happens in stream 2. Then, when I replay the graph in stream 3 ( a new stream ), what happens to the operation C? does it still remember stream 2? or it creates an internal stream?
If you capture a graph from stream 1 and stream 2, and execute it, it will never actually run in stream 1 and stream 2. The streams are only used to extract the dependencies between then nodes.
when I replay the graph in stream 3 ( a new stream ), does the graph internally creates two streams to execute the graph?
CUDA will correctly handle all dependencies within the graph. How this is achieved is an undocumented implementation detail. You could check with nsight-systems, for example, if other streams are used.
However, just like with ordinary multi-stream code, there is no guarantee that kernels B and C will be executed concurrently within the graph.
I understand that “there is no guarantee that kernels B and C will be executed concurrently”.
My concern is, if cudagraph does not create new streams, does it flatten the graph into a straight line (e.g. via topology sort), and then kernel B and C will never be executed concurrently?
No. From my experience, independent kernels are able to run concurrently in the graph.
Then I assume cudagraph uses internal mechanism to launch concurrent kernels, rather than using streams.
I think it should be sufficient to say concurrency is possible and provided for by graph capture, either using the stream capture method or the API capture method, without getting into assumptions about internal mechanisms or stream usage at the point of graph execution. The CUDA runtime is readily able to create streams (and many other entities, such as host threads) for its own usage, if that were intended by the CUDA designers. As already mentioned, I agree with the view that the behavior at the point of graph launch in this respect should be thought of as an implementation detail. Even if we were to try and tease out those details, they could change in any event.
Using API capture, one need not worry about streams per-se for the purpose of concurrency. The method of enabling concurrency there has to do with declaration of dependencies, not discussion or specification of streams. The picture, at least, in your original posting is indicating API graph definition.
For stream capture, the situation is more complicated (in my opinion). It is possible to have cross-stream activity but only a particular stream is designated for capture.
Thanks for the explanation! As long as “concurrency is possible and provided for by graph capture”, then it is good.
My original concern is, when I execute the graph in some stream, I only need to provide one stream, and I’m afraid that cuda flattens the graph into a straight line (e.g. via topology sort) so that kernels will fit in one stream. That would be bad.
Without getting into implementation details, concurrency is possible.
It’s trivial to demonstrate using API capture, and I have done that in the past. I haven’t tried to do it with stream capture, but my read of the programming guide suggests it is possible.