Enabling Dynamic Control Flow in CUDA Graphs with Device Graph Launch

jwitsoe · December 12, 2022, 8:51pm

Originally published at: https://developer.nvidia.com/blog/enabling-dynamic-control-flow-in-cuda-graphs-with-device-graph-launch/

CUDA device graph launch offers a performant way to enable dynamic control flow within CUDA kernels.

samrivaKS · December 16, 2022, 10:07am

Hi, from the article it’s not clear but I suppose we cannot update the graph on device, am I right?
I mean, changing parameters like the number of threads with cudaGraphExecKernelNodeSetParams() or enable/device kernels with cudaGraphNodeSetEnabled() for example

From the CUDA toolkit documentation the “cudaGraphLaunch” is still tagged as "__host__ ", is it a mistake? It should be possible to use that function on device now

Last question is about cudaGraphLaunch performances, with “graph length=100” do you mean you’re testing a graph with 100kernels or a graph with 100 sequential kernels?
I mean, if I have 10 straight lines with 10 sequential kernels… it’s a “graph length” equal to 100 or 10?
Thanks

sstevenson · December 19, 2022, 6:07pm

You cannot update the graph from the device, that is correct. Updates can only be performed from the host (this applies both to parameter updates as well as node enable/disable), and the graph must also be re-uploaded for the changes to take effect in subsequent device launches. We are thinking of adding device-side update functionality in a future release, though.

Yes, cudaGraphLaunch can be called from both the host and device. That does indeed appear to be a documentation error, thanks for bringing it to our attention! We’ll work on getting that fixed.

The length is the sequential length, not the general size. For the single-entry parallel straight-line and straight-line graphs, that means that the straight-line sections are 100 sequential kernels each; and for the fork join case, it is 100 sequential iterations of forking out into 2 nodes and then joining back into a single node.

I hope that answers all your questions.

sz1321 · April 12, 2024, 1:53am

Hi, I may find a mistake here (CUDA C++ Programming Guide, section 3.2.8.7.7.1.3):

According to the code provided, G1 should call taliLaunch(G2), instead of tailLaunch(G1).

Here is the link to the webpage:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-graph-launch