I am trying to use Cuda Graphs in my application, where I am computing a matrix-vector product with a matrix-free method for many vectors. To test this, I tried the following code:
bool graphCreated = false;
cudaGraph_t graph;
cudaGraphExec_t instance;
if (!graphCreated)
{
gpuErrchk(cudaStreamBeginCapture(s, cudaStreamCaptureModeGlobal));
for (int i = 0; i < 50000; i++)
my_function(inputs);
gpuErrchk(cudaStreamEndCapture(s, &graph));
gpuErrchk(cudaGraphInstantiate(&instance, graph, NULL, NULL, 0));
graphCreated = true;
}
gpuErrchk(cudaGraphLaunch(instance, s));
gpuErrchk(cudaStreamSynchronize(s));
gpuErrchk(cudaGraphExecDestroy(instance));
gpuErrchk(cudaGraphDestroy(graph));
I had “my_function” print something every 100 iterations to keep track of how many times it was called. I noticed that as the iteration number increased, the time per iteration was increasing. This was strange since for testing, the “inputs” parameters are the same each time. Then, around 30,000 iterations, I got an error “allocation of new ptrcache entry failed” and the program quit.
my_function contains some explicit CUDA kernels, CUBLAS calls, and NCCL communication calls. I removed all the cudaStreamSynchronize() calls from inside my_function.
To see if it was a memory issue, I ran the code again with much smaller input vectors. However, the same thing happened here. Monitoring nvidia-smi, I noticed a slight increase in memory usage as the iterations went on, but nothing close to the 40GB limit. my_function does not allocate/free any memory; all of that is done before/after this piece of code.