Hi guys.
I am looking into the cuda graph feature. Cuda graph was also integrated into Pytorch. A captured graph acts on the same virtual addresses every time it replays. To achieve this, pytorch implement a private memory pool in which the virtual addresses used by the graph are reserved for the graph across replays. But it seems the reserved memory pool is not a requirement with the new feature, cudaMallocAsync. Since memory allocations in any region will be safe to capture. I am wondering how to integrate these two to avoid the request for a private memory pool? Any high level idea or example is appreciated.
Thanks