How to use cuda graph with cudaMallocAsync?

Hi guys.

I am looking into the cuda graph feature. Cuda graph was also integrated into Pytorch. A captured graph acts on the same virtual addresses every time it replays. To achieve this, pytorch implement a private memory pool in which the virtual addresses used by the graph are reserved for the graph across replays. But it seems the reserved memory pool is not a requirement with the new feature, cudaMallocAsync. Since memory allocations in any region will be safe to capture. I am wondering how to integrate these two to avoid the request for a private memory pool? Any high level idea or example is appreciated.


I think my question is that, if we have cudaMallocAsync, do we still need fixed input and output memory address for a captured graph?

@Robert_Crovella Hi Robert, Do you have any basic idea?

Sorry, I don’t know much about pytorch internals.

If your graph would typically use a space that had been allocated with cudaMalloc, I’m not sure there is any difference in graph setup if the space happens to be allocated with cudaMallocAsync.

You still need fixed input / output virtual addresses for the captured graph.
If you want to change the addresses of input / output buffers, you will have to update the graph such that it is aware of these new addresses.

Similarly captured cudaMallocAsync operations (ones that are translated to graph memory nodes by being between the begin capture / end capture) get fixed virtual addresses. While the captured asynchronous allocation keeps the virtual address reserved, it can release the physical memory while the graph isn’t running.

Graph memory nodes do not automatically update the work inside the graph to refer to different virtual addresses than the ones used during graph construction.