I am currently experimenting some optimizations on the memory usage of CUDA applications.
I want to programatically move data between the host (CPU) and the device (GPU). cudaMemPrefetchAsync
does allows me to move data using Unified Memory, when called directly.
However, it cannot be called in a CUDA Graph:
- When calling
cudaMemPrefetchAsync
while the stream is being captured, it will return error code 900, which iscudaErrorStreamCaptureUnsupported
. - It also cannot be explicitly added to the graph because there is no node type corresponding with this operation, and if I call
cudaMemPrefetchAsync
from a host node in the graph, it will report another error code 800,cudaErrorNotPermitted
. I believe this is because of the limitation that host nodes in CUDA graph cannot call CUDA APIs.
Without cudaMemPrefetchAsync
, I can move the data from device to host by explicity accessing all the memory on the host, and vice versa, though this has a huge overhead as you can imagine.
So I wonder if there is a way to call cudaMemPrefetchAsync or quickly move data using Unified Memory in a CUDA graph.
Many thanks in advance!