CUDA Graph and TensorRT batch inference

I used Nsight Systems to visualize a tensorrt batch inference (ExecutionContext::execute).
I saw the kernel launchers and the kernel executions for one batch inference.
Now I would like to launch all these kernels in a single operation by using a CUDA Graph.
I read about cudaStreamCapture mode and this tutorial :

I ended up with the following code :

   cudaGraph_t graph;
   cudaGraphExec_t instance;

   buffers.copyInputToDevice();
    
   cudaStreamBeginCapture(0, cudaStreamCaptureModeGlobal);
   context->executeV2(buffers.getDeviceBindings().data());
   cudaStreamEndCapture(0, &graph);

   cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);

   cudaGraphLaunch(instance, 0);

   buffers.copyOutputToHost();

NB : I made the CudaStreamCapture on the Default Stream because that is where batch inference is done.

The problem:
cudaGraphLaunch is visible in CUDA API row on Nsight but it is not followed by any kernel execution …

Is there a obvious reason why it does not work ?

I read this :
Calling [enqueueV2()] with a stream in CUDA graph capture mode has a known issue. If dynamic shapes are used, the first [enqueueV2()] call after a [setInputShapeBinding()] call will cause failure in stream capture due to resource allocation. Please call [enqueueV2()] once before capturing the graph.
But my model does not use dynamic shapes and I used synchronous execute function…

Environment :

TensorRT Version: 7.2
CUDA Version: 11.2
CUDNN Version: 11.2

Hi @juliefraysse,

Stream capture might not work on the default stream:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g793d7d4e474388ddfda531603dc34aa3

Capture may not be initiated if stream is cudaStreamLegacy

See CUDA Runtime API :: CUDA Toolkit Documentation for details. So it might be better to create a stream explicitly and use an asynchronous enqueueV2 .
See TensorRT/bert_infer.h at release/7.1 · NVIDIA/TensorRT · GitHub for an example

Thank you.