Getting Started with CUDA Graphs

Originally published at:

The performance of GPU architectures continue to increase with every new generation. Modern GPUs are so fast that, in many cases of interest, the time taken by each GPU operation (e.g. kernel or memory copy) is now measured in microseconds. However, there are overheads associated with the submission of each operation to the GPU –…

Hi Alan,

Interesting post. I tried using the manual mode with some applications, and noticed the graph also can explore concurrent streams for the kernel nodes automatically. This is an interesting feature that applies beyond short runtime kernels, since I don't need the partitioning work anymore. I am wondering what other optimizations I can expect from such a graph implementation?

That’s a very good point. To achieve optimal performance for an application, you need to expose the parallelism inherent to that application as fully as possible. When using the manual method to create a CUDA graph, you do this by explicitly specifying dependencies, and the graph will be built with as many parallel branches as possible given these dependencies. When capturing a graph from CUDA streams, the parallelism will be the same as that of your original stream based code. So if your stream-based code was already fully exposing the available parallelism, the graph would be exactly the same and there would be no benefit from building it manually. But in many cases, the manual approach may end up exposing extra available parallelism (as you found), possibly at the expense of more effort and code disruption (depending on the application). The best approach should be decided on a case by case basis, given the costs and benefits for that specific case.

What if I need to modify the kernel's parameter first before calling another kernel? What if I need to call cudaDeviceSynchronize before executing another child graph?

Hi Alan, I can't see the benefit in your example, and as I´ve understood the CUDAGraph purpose is to implement a "circuit" of kernels as an alternative of dynamic parallel processing. In the source of simpleCUDAGraphs sample it is much more clarify, but still I have not found a sufficiently instructive example. Could you please post a simple example of how to implement a Graph with different kernels and having graphs as nodes aswell as kernells? Thanks.

Hi Alan, is graph executor thread-safe? Can I have a centralized executor with multiple threads to submit graphs at the same time? I know graph is not thread-safe.

Can the CUDA stream record and capture build a graph that includes an Optix 7 optixLaunch() call? Optix 7 is CUDA compatible, but launches its own kernels, in a user selected stream.

Pat, just wanted to let you know that we're working on an answer. We'll get back to you soon.

In general, there is scope to apply CUDA graphs to any CUDA compatible API, but doing so relies on the internal functionality of that API only performing activities that are supported by graphs. We are not aware of anyone else having tried this combination so far, so we had to investigate. Unfortunately it looks like OptiX is not currently capturable into a graph. When OptiX launches work, it adds incoming/outgoing events around the work items which are not yet supported by graphs, and this type of “eager” resource assignment needs some rework to be made fully asynchronous. But this question has highlighted to us that we need to bring the different teams together to make this happen. So many thanks for bringing up this issue, and we hope to support interoperability between graphs and Optix in a future release.

Is it possible to run the same graph in multiple devices? Will child graphs nodes always run in the same stream?