CUDA Graph and TensorRT batch inference

juliefraysse · April 15, 2021, 12:15pm

I used Nsight Systems to visualize a tensorrt batch inference (ExecutionContext::execute).
I saw the kernel launchers and the kernel executions for one batch inference.
Now I would like to launch all these kernels in a single operation by using a CUDA Graph.
I read about cudaStreamCapture mode and this tutorial :

I ended up with the following code :

   cudaGraph_t graph;
   cudaGraphExec_t instance;

   buffers.copyInputToDevice();
    
   cudaStreamBeginCapture(0, cudaStreamCaptureModeGlobal);
   context->executeV2(buffers.getDeviceBindings().data());
   cudaStreamEndCapture(0, &graph);

   cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);

   cudaGraphLaunch(instance, 0);

   buffers.copyOutputToHost();

NB : I made the CudaStreamCapture on the Default Stream because that is where batch inference is done.

The problem:
cudaGraphLaunch is visible in CUDA API row on Nsight but it is not followed by any kernel execution …

Is there a obvious reason why it does not work ?

I read this :
Calling [enqueueV2()] with a stream in CUDA graph capture mode has a known issue. If dynamic shapes are used, the first [enqueueV2()] call after a [setInputShapeBinding()] call will cause failure in stream capture due to resource allocation. Please call [enqueueV2()] once before capturing the graph.
But my model does not use dynamic shapes and I used synchronous execute function…

Environment :

TensorRT Version: 7.2
CUDA Version: 11.2
CUDNN Version: 11.2

spolisetty · April 16, 2021, 3:49pm

Hi @juliefraysse,

Stream capture might not work on the default stream:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g793d7d4e474388ddfda531603dc34aa3

Capture may not be initiated if stream is cudaStreamLegacy

See CUDA Runtime API :: CUDA Toolkit Documentation for details. So it might be better to create a stream explicitly and use an asynchronous enqueueV2 .
See TensorRT/bert_infer.h at release/7.1 · NVIDIA/TensorRT · GitHub for an example

Thank you.

Topic		Replies	Views
Getting Started with CUDA Graphs Technical Blog	11	2092	January 8, 2024
Multi-stream graph CUDA Programming and Performance	3	117	February 5, 2025
Run inference on a batch of images & parallel inference using cuda on python threads TensorRT tensorrt , cuda	6	2270	January 6, 2022
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	956	May 5, 2021
Is multi threaded execution possible with tensorRT? TensorRT	3	2235	April 13, 2020
Should I call cudaStreamSynchronize before executeV2? TensorRT	3	433	March 25, 2023
cudaGraph Stream Capture CUDA Programming and Performance cuda	1	601	August 15, 2023
Cannot get any stream parallelism. CUDA Programming and Performance	13	1280	December 31, 2019
Constructing CUDA Graphs with Dynamic Parameters Technical Blog	1	417	August 23, 2022
Model inference fails after Holoscan 2.6 update: Not compatible with useCudaGraph TensorRT tensorrt , cuda , jetson-inference , cudnn , jetson-orin	1	52	November 30, 2024

CUDA Graph and TensorRT batch inference

Related topics