How to use cuda graph with cudaMallocAsync?

Hi guys.

I am looking into the cuda graph feature. Cuda graph was also integrated into Pytorch. A captured graph acts on the same virtual addresses every time it replays. To achieve this, pytorch implement a private memory pool in which the virtual addresses used by the graph are reserved for the graph across replays. But it seems the reserved memory pool is not a requirement with the new feature, cudaMallocAsync. Since memory allocations in any region will be safe to capture. I am wondering how to integrate these two to avoid the request for a private memory pool? Any high level idea or example is appreciated.


I think my question is that, if we have cudaMallocAsync, do we still need fixed input and output memory address for a captured graph?

@Robert_Crovella Hi Robert, Do you have any basic idea?

Sorry, I don’t know much about pytorch internals.

If your graph would typically use a space that had been allocated with cudaMalloc, I’m not sure there is any difference in graph setup if the space happens to be allocated with cudaMallocAsync.