CUDA Graph error "allocation of new ptrcache entry failed"

s769.098211 · May 14, 2024, 3:58pm

I am trying to use Cuda Graphs in my application, where I am computing a matrix-vector product with a matrix-free method for many vectors. To test this, I tried the following code:

bool graphCreated = false;
cudaGraph_t graph;
cudaGraphExec_t instance;
if (!graphCreated)
    {
        gpuErrchk(cudaStreamBeginCapture(s, cudaStreamCaptureModeGlobal));
        for (int i = 0; i < 50000; i++)
            my_function(inputs);
        gpuErrchk(cudaStreamEndCapture(s, &graph));
        gpuErrchk(cudaGraphInstantiate(&instance, graph, NULL, NULL, 0));
        graphCreated = true;
    }
gpuErrchk(cudaGraphLaunch(instance, s));
gpuErrchk(cudaStreamSynchronize(s));
gpuErrchk(cudaGraphExecDestroy(instance));
gpuErrchk(cudaGraphDestroy(graph));

I had “my_function” print something every 100 iterations to keep track of how many times it was called. I noticed that as the iteration number increased, the time per iteration was increasing. This was strange since for testing, the “inputs” parameters are the same each time. Then, around 30,000 iterations, I got an error “allocation of new ptrcache entry failed” and the program quit.

my_function contains some explicit CUDA kernels, CUBLAS calls, and NCCL communication calls. I removed all the cudaStreamSynchronize() calls from inside my_function.

To see if it was a memory issue, I ran the code again with much smaller input vectors. However, the same thing happened here. Monitoring nvidia-smi, I noticed a slight increase in memory usage as the iterations went on, but nothing close to the 40GB limit. my_function does not allocate/free any memory; all of that is done before/after this piece of code.

Robert_Crovella · May 14, 2024, 4:10pm

It does seem like you probably cannot do something an infinite number of times and expect graph capture to work. Yes, 50000 is not “infinite”, but the point is that there is likely to be some upper limit to what you can capture. You seem to have found it. The space required here, and the space that is being allocated, might be host memory space, or even space in a fixed-size structure.

If 50000 iterations is what you need, and you can successfully do 25,000 iterations, then perhaps you could do it twice, in two graphs, or perhaps even two launches of the same graph.

s769.098211 · May 14, 2024, 5:41pm

Thank you. I think it could be the size limit of a fixed-size structure that I am hitting. I monitored “top” as well, and saw that the CPU memory usage was not that big either. I will try the approach of chunking it into smaller parts that can each be captured by a graph.

system · May 28, 2024, 5:41pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA Graph: Limit on number of kernels that can be captured in CUDA Graph Legacy PGI Compilers	4	5126	August 15, 2023
Multiple CPU threads run with CUDA Graphs causes continuous increase in memory usage CUDA Programming and Performance	3	811	October 16, 2020
Memory leak with cudagraph? CUDA Programming and Performance	4	439	June 10, 2024
Access Violation while launching Graph CUDA Programming and Performance	2	609	July 29, 2022
Crash when using cuda graph CUDA Programming and Performance cuda	3	951	September 25, 2023
CUDA Graph Memory Reservations CUDA Programming and Performance cuda	3	1028	May 28, 2024
[CUDA Graph] cuBLAS routine produces incorrect result after calling cudaStreamBeginCapture CUDA Programming and Performance	2	757	March 2, 2022
Increasing memory footprint with large task graphs CUDA Programming and Performance cuda	3	656	March 6, 2023
[BUG][DEVICE LAUNCHED GRAPHS] CudaErrorInvalidValue with cuBLAS combined with the cudaGraphInstantiateFlagsDeviceLaunch starting with CUDA 12.1 CUDA Programming and Performance nvbugs	4	867	February 19, 2025
cudaMalloc error in big loop CUDA Programming and Performance	12	15757	May 21, 2008

CUDA Graph error "allocation of new ptrcache entry failed"

Related topics