Getting Started with CUDA Graphs

jwitsoe · September 10, 2019, 4:02pm

Originally published at: https://developer.nvidia.com/blog/cuda-graphs/

The performance of GPU architectures continue to increase with every new generation. Modern GPUs are so fast that, in many cases of interest, the time taken by each GPU operation (e.g. kernel or memory copy) is now measured in microseconds. However, there are overheads associated with the submission of each operation to the GPU –…

anon3747286 · September 25, 2019, 7:32pm

Hi Alan,

Interesting post. I tried using the manual mode with some applications, and noticed the graph also can explore concurrent streams for the kernel nodes automatically. This is an interesting feature that applies beyond short runtime kernels, since I don't need the partitioning work anymore. I am wondering what other optimizations I can expect from such a graph implementation?

anon26215296 · October 1, 2019, 1:31pm

That’s a very good point. To achieve optimal performance for an application, you need to expose the parallelism inherent to that application as fully as possible. When using the manual method to create a CUDA graph, you do this by explicitly specifying dependencies, and the graph will be built with as many parallel branches as possible given these dependencies. When capturing a graph from CUDA streams, the parallelism will be the same as that of your original stream based code. So if your stream-based code was already fully exposing the available parallelism, the graph would be exactly the same and there would be no benefit from building it manually. But in many cases, the manual approach may end up exposing extra available parallelism (as you found), possibly at the expense of more effort and code disruption (depending on the application). The best approach should be decided on a case by case basis, given the costs and benefits for that specific case.

anon45708981 · October 17, 2019, 9:47am

What if I need to modify the kernel's parameter first before calling another kernel? What if I need to call cudaDeviceSynchronize before executing another child graph?

anon10667251 · January 11, 2020, 2:54pm

Hi Alan, I can't see the benefit in your example, and as I´ve understood the CUDAGraph purpose is to implement a "circuit" of kernels as an alternative of dynamic parallel processing. In the source of simpleCUDAGraphs sample it is much more clarify, but still I have not found a sufficiently instructive example. Could you please post a simple example of how to implement a Graph with different kernels and having graphs as nodes aswell as kernells? Thanks.

anon97074218 · February 22, 2020, 3:04am

Hi Alan, is graph executor thread-safe? Can I have a centralized executor with multiple threads to submit graphs at the same time? I know graph is not thread-safe.

anon58604913 · March 27, 2020, 10:55pm

Can the CUDA stream record and capture build a graph that includes an Optix 7 optixLaunch() call? Optix 7 is CUDA compatible, but launches its own kernels, in a user selected stream.

anon70576265 · March 31, 2020, 10:28pm

Pat, just wanted to let you know that we're working on an answer. We'll get back to you soon.

anon26215296 · April 2, 2020, 1:07pm

In general, there is scope to apply CUDA graphs to any CUDA compatible API, but doing so relies on the internal functionality of that API only performing activities that are supported by graphs. We are not aware of anyone else having tried this combination so far, so we had to investigate. Unfortunately it looks like OptiX is not currently capturable into a graph. When OptiX launches work, it adds incoming/outgoing events around the work items which are not yet supported by graphs, and this type of “eager” resource assignment needs some rework to be made fully asynchronous. But this question has highlighted to us that we need to bring the different teams together to make this happen. So many thanks for bringing up this issue, and we hope to support interoperability between graphs and Optix in a future release.

mcleary · November 11, 2020, 10:38am

Is it possible to run the same graph in multiple devices? Will child graphs nodes always run in the same stream?

trivedi.nagaraj · January 8, 2024, 9:35am

Hi Alan, I want to know that can the cuda graph be captured during asynchronous copy between host and device memory and also when we use context->enqueuev2().
The above query is related to the tensorrt sample code from the directory
/usr/src/tensorrt/sampleINT8API/sampleINT8.cpp file and the mothod/api is the the infer()
I am pasting the code for your reference. Please let me know whether it can be modified to implement the cuda graph. I am pasting error below and also the code which I modified for cuda graph implement is attached to this post. The code which is under the macro TRT_DEBUG is the one which I have added. The rest is as is from the /usr/src/tensorrt/sampleINT8API/sampleINT8.cpp which is downloaded from the NVIDIA site.

Error message

[01/07/2024-03:22:44] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1373 MiB, GPU 7177 MiB
[01/07/2024-03:22:44] [I] Started capturing CUDA graph

[01/07/2024-03:22:44] [E] [TRT] 1: [blobInfo.cpp::getHostScale::803] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
[01/07/2024-03:22:44] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
[01/07/2024-03:22:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1372, GPU 7177 (MiB)
[01/07/2024-03:22:44] [F] [TRT] [defaultAllocator.cpp::free::85] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)

I have modified only one API sample::Logger::TestResult SampleINT8API::infer()

code_modification_cuda_graph_capture.txt (2.3 KB)

trivedi.nagaraj · January 8, 2024, 11:59am

Hi Alan, I have fixed this issue. We need to call the context->enqueuev2() once before starting capturing the cuda graph.
I referred this NVIDIA github link, implemented as per its suggestions and it worked.

github.com/NVIDIA/TensorRT

tensor rt cuda graph use with plugin.

opened 06:14AM - 29 Sep 20 UTC

closed 09:42AM - 02 Nov 20 UTC

kycocotree

Plugins triaged

## Description ## Environment **TensorRT Version**: 7.1.3.4 **GPU …Type**: GTX 1080 **Nvidia Driver Version**: 440.100 **CUDA Version**: 10.2 **CUDNN Version**: 8.0 **Operating System + Version**: Ubuntu 18.04 **Python Version (if applicable)**: **TensorFlow Version (if applicable)**: **PyTorch Version (if applicable)**: **Baremetal or Container (which commit + image + tag)**: ## Relevant Files ## Steps To Reproduce  Hi, I made a tensorrt engine by adding a plugin to the yolov3 model. I want to launch a graph using cudaStreamCapture function. (bellow code, (BuildGraph).) However, when capturing a graph, 900 cudaError occurs in the cuda kernel function added as a plugin. How can I create a kernel function made for plugin use to capture it in graph? bool TRTContext::BuildGraph(trt_device::TrtCudaStream& stream) { /* https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#ac7a5737264c2b7860baef0096d961f5a Note: Calling enqueueV2() with a stream in CUDA graph capture mode has a known issue. If dynamic shapes are used, the first enqueueV2() call after a setInputShapeBinding() call will cause failure in stream capture due to resource allocation. Please call enqueueV2() once before capturing the graph. */ bool captured = true; if (m_bExplicitBatch_) { m_pExecutionContext_->enqueueV2(&m_vecDeviceBindings_[0], stream.get(), nullptr); } captured = m_cudaGraph_.beginCapture(stream); if (captured) { if (m_bExplicitBatch_) { if (!m_pExecutionContext_->enqueueV2(&m_vecDeviceBindings_[0], stream.get(), nullptr)) { // CHECK(cuerr); captured = false; } printf("[%s] Use enqueueV2\n", __FUNCTION__); } else { if (!m_pExecutionContext_->enqueue(m_iMaxBatchSize_, &m_vecDeviceBindings_[0], stream.get(), nullptr)) { // CHECK(cuerr); captured = false; } } if (captured) { captured = m_cudaGraph_.endCapture(stream); } } return captured; } In .cu file int cuda::decode (int batch_size, const void *const *inputs, void *outputs, size_t height, size_t width, size_t num_count, size_t num_classes, const std::vector<float> &anchors, size_t stride, size_t grid_w, size_t grid_h, void *workspace, size_t workspace_size, cudaStream_t stream) { int anchor_size = anchors.size() / num_count; int anchors_byte_size = sizeof(float) * anchors.size(); void* anchors_d = nullptr; CUDA_CHECK(cudaMalloc(&anchors_d, anchors_byte_size)); CUDA_CHECK(cudaMemcpyAsync(anchors_d, anchors.data(), anchors_byte_size, cudaMemcpyHostToDevice, stream)); cudaStreamSynchronize(stream); int input_elements_size = grid_w * grid_h * num_count * (1 + LOCATIONS + num_classes); int output_elements_size = (grid_w * grid_h * num_count * sizeof(Detection) / sizeof(float)); CUDA_CHECK(cudaMemset(static_cast<float*>(outputs), 0, output_elements_size * batch_size * sizeof(float))); int num_threads = 1024; int num_elements = grid_h * grid_w; if (num_elements < num_threads) num_threads = num_elements; int num_blocks = 1; for (int b = 0; b < batch_size; b++) { const float* input = static_cast<const float *>(inputs[0]) + (b * input_elements_size); float* output = static_cast<float *>(outputs) + (b * output_elements_size); const int num_per_thread = num_elements / num_threads; if (num_per_thread >= 1) num_blocks = num_per_thread; //printf("num_threads: %d, num_per_thread: %d, num_elements:%d\n", num_threads, num_per_thread, num_elements); decode_kernel <<< num_blocks, num_threads, 0, stream >>> (num_per_thread, input, output, static_cast<float*>(anchors_d), anchor_size, grid_w, grid_h, width, height, num_classes, num_count, num_elements); } cudaStreamSynchronize(stream); CUDA_CHECK(cudaFree(anchors_d)); return cudaSuccess; }

Thanks and Regards

Nagaraj Trivedi

Topic		Replies	Views
Constructing CUDA Graphs with Dynamic Parameters Technical Blog	1	420	August 23, 2022
Employing CUDA Graphs in a Dynamic Environment Technical Blog	3	796	February 8, 2022
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2125	February 5, 2020
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6030	April 2, 2013
CUDA Graph multi-GPU performance CUDA Programming and Performance cuda , performance	1	980	August 23, 2023
Cannot get any stream parallelism. CUDA Programming and Performance	13	1297	December 31, 2019
Stream capture of cublas gemm CUDA Programming and Performance	8	685	March 31, 2025
An Even Easier Introduction to CUDA Technical Blog	141	6380	November 28, 2023
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2239	January 18, 2023
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009

Getting Started with CUDA Graphs

Related topics