How to use cuda graph with cudaMallocAsync?

menesss · May 27, 2022, 7:52pm

Hi guys.

I am looking into the cuda graph feature. Cuda graph was also integrated into Pytorch. A captured graph acts on the same virtual addresses every time it replays. To achieve this, pytorch implement a private memory pool in which the virtual addresses used by the graph are reserved for the graph across replays. But it seems the reserved memory pool is not a requirement with the new feature, cudaMallocAsync. Since memory allocations in any region will be safe to capture. I am wondering how to integrate these two to avoid the request for a private memory pool? Any high level idea or example is appreciated.

Thanks

menesss · May 27, 2022, 8:28pm

I think my question is that, if we have cudaMallocAsync, do we still need fixed input and output memory address for a captured graph?

menesss · May 30, 2022, 12:35am

@Robert_Crovella Hi Robert, Do you have any basic idea?

Robert_Crovella · May 30, 2022, 1:58pm

Sorry, I don’t know much about pytorch internals.

If your graph would typically use a space that had been allocated with cudaMalloc, I’m not sure there is any difference in graph setup if the space happens to be allocated with cudaMallocAsync.

hhoffman · August 11, 2022, 3:57pm

You still need fixed input / output virtual addresses for the captured graph.
If you want to change the addresses of input / output buffers, you will have to update the graph such that it is aware of these new addresses.

Similarly captured cudaMallocAsync operations (ones that are translated to graph memory nodes by being between the begin capture / end capture) get fixed virtual addresses. While the captured asynchronous allocation keeps the virtual address reserved, it can release the physical memory while the graph isn’t running.

Graph memory nodes do not automatically update the work inside the graph to refer to different virtual addresses than the ones used during graph construction.

Topic		Replies	Views
How to share the memory allocated by cudamallocasync during graph capture? CUDA Programming and Performance	2	363	June 1, 2024
Is there a way to call cudaMemPrefetchAsync in a CUDA Graph? CUDA Programming and Performance cuda , performance	0	452	July 30, 2023
A list of APIs which does not have an "async" equivalent (in the context of CUDA Graphs)? CUDA Programming and Performance	4	487	November 15, 2023
Memory leak with cudagraph? CUDA Programming and Performance	4	454	June 10, 2024
"cudaGraphExecMemcpyNodeSetParams1D" fails when using device pointer from a "cudaMallocAsync" CUDA Programming and Performance	1	62	January 29, 2025
Asynchronous cudaMallocFree/cudaFreeAsync per GPU? CUDA Programming and Performance	1	78	February 3, 2025
How to put cudaMemcpyPeerAsync into a graph? CUDA Programming and Performance	3	446	August 31, 2022
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2 Technical Blog	12	1460	September 12, 2023
CUDA Graph in Cuda Fortran 2 Legacy PGI Compilers	1	3639	January 22, 2020
CudaMallocAsync-cudaFreeAsync CUDA Programming and Performance	2	812	August 21, 2023

How to use cuda graph with cudaMallocAsync?

Related topics