CUDA Graph error "allocation of new ptrcache entry failed"

I am trying to use Cuda Graphs in my application, where I am computing a matrix-vector product with a matrix-free method for many vectors. To test this, I tried the following code:

bool graphCreated = false;
cudaGraph_t graph;
cudaGraphExec_t instance;
if (!graphCreated)
    {
        gpuErrchk(cudaStreamBeginCapture(s, cudaStreamCaptureModeGlobal));
        for (int i = 0; i < 50000; i++)
            my_function(inputs);
        gpuErrchk(cudaStreamEndCapture(s, &graph));
        gpuErrchk(cudaGraphInstantiate(&instance, graph, NULL, NULL, 0));
        graphCreated = true;
    }
gpuErrchk(cudaGraphLaunch(instance, s));
gpuErrchk(cudaStreamSynchronize(s));
gpuErrchk(cudaGraphExecDestroy(instance));
gpuErrchk(cudaGraphDestroy(graph));

I had “my_function” print something every 100 iterations to keep track of how many times it was called. I noticed that as the iteration number increased, the time per iteration was increasing. This was strange since for testing, the “inputs” parameters are the same each time. Then, around 30,000 iterations, I got an error “allocation of new ptrcache entry failed” and the program quit.

my_function contains some explicit CUDA kernels, CUBLAS calls, and NCCL communication calls. I removed all the cudaStreamSynchronize() calls from inside my_function.

To see if it was a memory issue, I ran the code again with much smaller input vectors. However, the same thing happened here. Monitoring nvidia-smi, I noticed a slight increase in memory usage as the iterations went on, but nothing close to the 40GB limit. my_function does not allocate/free any memory; all of that is done before/after this piece of code.

It does seem like you probably cannot do something an infinite number of times and expect graph capture to work. Yes, 50000 is not “infinite”, but the point is that there is likely to be some upper limit to what you can capture. You seem to have found it. The space required here, and the space that is being allocated, might be host memory space, or even space in a fixed-size structure.

If 50000 iterations is what you need, and you can successfully do 25,000 iterations, then perhaps you could do it twice, in two graphs, or perhaps even two launches of the same graph.

1 Like

Thank you. I think it could be the size limit of a fixed-size structure that I am hitting. I monitored “top” as well, and saw that the CPU memory usage was not that big either. I will try the approach of chunking it into smaller parts that can each be captured by a graph.