I used Nsight Systems to visualize a tensorrt batch inference (ExecutionContext::execute).
I saw the kernel launchers and the kernel executions for one batch inference.
Now I would like to launch all these kernels in a single operation by using a CUDA Graph.
I read about cudaStreamCapture mode and this tutorial :
I ended up with the following code :
cudaGraph_t graph; cudaGraphExec_t instance; buffers.copyInputToDevice(); cudaStreamBeginCapture(0, cudaStreamCaptureModeGlobal); context->executeV2(buffers.getDeviceBindings().data()); cudaStreamEndCapture(0, &graph); cudaGraphInstantiate(&instance, graph, NULL, NULL, 0); cudaGraphLaunch(instance, 0); buffers.copyOutputToHost();
NB : I made the CudaStreamCapture on the Default Stream because that is where batch inference is done.
cudaGraphLaunch is visible in CUDA API row on Nsight but it is not followed by any kernel execution …
Is there a obvious reason why it does not work ?
I read this :
Calling [enqueueV2()] with a stream in CUDA graph capture mode has a known issue. If dynamic shapes are used, the first [enqueueV2()] call after a [setInputShapeBinding()] call will cause failure in stream capture due to resource allocation. Please call [enqueueV2()] once before capturing the graph.
But my model does not use dynamic shapes and I used synchronous execute function…
TensorRT Version: 7.2
CUDA Version: 11.2
CUDNN Version: 11.2