Getting Started with CUDA Graphs

Originally published at: https://developer.nvidia.com/blog/cuda-graphs/

The performance of GPU architectures continue to increase with every new generation. Modern GPUs are so fast that, in many cases of interest, the time taken by each GPU operation (e.g. kernel or memory copy) is now measured in microseconds. However, there are overheads associated with the submission of each operation to the GPU –…

Hi Alan,

Interesting post. I tried using the manual mode with some applications, and noticed the graph also can explore concurrent streams for the kernel nodes automatically. This is an interesting feature that applies beyond short runtime kernels, since I don't need the partitioning work anymore. I am wondering what other optimizations I can expect from such a graph implementation?

That’s a very good point. To achieve optimal performance for an application, you need to expose the parallelism inherent to that application as fully as possible. When using the manual method to create a CUDA graph, you do this by explicitly specifying dependencies, and the graph will be built with as many parallel branches as possible given these dependencies. When capturing a graph from CUDA streams, the parallelism will be the same as that of your original stream based code. So if your stream-based code was already fully exposing the available parallelism, the graph would be exactly the same and there would be no benefit from building it manually. But in many cases, the manual approach may end up exposing extra available parallelism (as you found), possibly at the expense of more effort and code disruption (depending on the application). The best approach should be decided on a case by case basis, given the costs and benefits for that specific case.

What if I need to modify the kernel's parameter first before calling another kernel? What if I need to call cudaDeviceSynchronize before executing another child graph?

Hi Alan, I can't see the benefit in your example, and as I´ve understood the CUDAGraph purpose is to implement a "circuit" of kernels as an alternative of dynamic parallel processing. In the source of simpleCUDAGraphs sample it is much more clarify, but still I have not found a sufficiently instructive example. Could you please post a simple example of how to implement a Graph with different kernels and having graphs as nodes aswell as kernells? Thanks.

Hi Alan, is graph executor thread-safe? Can I have a centralized executor with multiple threads to submit graphs at the same time? I know graph is not thread-safe.

Can the CUDA stream record and capture build a graph that includes an Optix 7 optixLaunch() call? Optix 7 is CUDA compatible, but launches its own kernels, in a user selected stream.

Pat, just wanted to let you know that we're working on an answer. We'll get back to you soon.

In general, there is scope to apply CUDA graphs to any CUDA compatible API, but doing so relies on the internal functionality of that API only performing activities that are supported by graphs. We are not aware of anyone else having tried this combination so far, so we had to investigate. Unfortunately it looks like OptiX is not currently capturable into a graph. When OptiX launches work, it adds incoming/outgoing events around the work items which are not yet supported by graphs, and this type of “eager” resource assignment needs some rework to be made fully asynchronous. But this question has highlighted to us that we need to bring the different teams together to make this happen. So many thanks for bringing up this issue, and we hope to support interoperability between graphs and Optix in a future release.

1 Like

Is it possible to run the same graph in multiple devices? Will child graphs nodes always run in the same stream?

Hi Alan, I want to know that can the cuda graph be captured during asynchronous copy between host and device memory and also when we use context->enqueuev2().
The above query is related to the tensorrt sample code from the directory
/usr/src/tensorrt/sampleINT8API/sampleINT8.cpp file and the mothod/api is the the infer()
I am pasting the code for your reference. Please let me know whether it can be modified to implement the cuda graph. I am pasting error below and also the code which I modified for cuda graph implement is attached to this post. The code which is under the macro TRT_DEBUG is the one which I have added. The rest is as is from the /usr/src/tensorrt/sampleINT8API/sampleINT8.cpp which is downloaded from the NVIDIA site.

Error message

[01/07/2024-03:22:44] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1373 MiB, GPU 7177 MiB
[01/07/2024-03:22:44] [I] Started capturing CUDA graph

[01/07/2024-03:22:44] [E] [TRT] 1: [blobInfo.cpp::getHostScale::803] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
[01/07/2024-03:22:44] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
[01/07/2024-03:22:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1372, GPU 7177 (MiB)
[01/07/2024-03:22:44] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)

I have modified only one API sample::Logger::TestResult SampleINT8API::infer()

code_modification_cuda_graph_capture.txt (2.3 KB)

Hi Alan, I have fixed this issue. We need to call the context->enqueuev2() once before starting capturing the cuda graph.
I referred this NVIDIA github link, implemented as per its suggestions and it worked.

Thanks and Regards

Nagaraj Trivedi