I used Nsight Systems to visualize a tensorrt batch inference (ExecutionContext::execute).
I saw the kernel launchers and the kernel executions for one batch inference.
Now I would like to launch all these kernels in a single operation by using a CUDA Graph.
I read about cudaStreamCapture mode and this tutorial :
I ended up with the following code :
cudaGraph_t graph;
cudaGraphExec_t instance;
buffers.copyInputToDevice();
cudaStreamBeginCapture(0, cudaStreamCaptureModeGlobal);
context->executeV2(buffers.getDeviceBindings().data());
cudaStreamEndCapture(0, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaGraphLaunch(instance, 0);
buffers.copyOutputToHost();
NB : I made the CudaStreamCapture on the Default Stream because that is where batch inference is done.
The problem:
cudaGraphLaunch is visible in CUDA API row on Nsight but it is not followed by any kernel execution …
Is there a obvious reason why it does not work ?
I read this :
Calling [enqueueV2()] with a stream in CUDA graph capture mode has a known issue. If dynamic shapes are used, the first [enqueueV2()] call after a [setInputShapeBinding()] call will cause failure in stream capture due to resource allocation. Please call [enqueueV2()] once before capturing the graph.
But my model does not use dynamic shapes and I used synchronous execute function…
Environment :
TensorRT Version: 7.2
CUDA Version: 11.2
CUDNN Version: 11.2