I am currently experimenting some optimizations on the memory usage of CUDA applications.
I want to programatically move data between the host (CPU) and the device (GPU).
cudaMemPrefetchAsync does allows me to move data using Unified Memory, when called directly.
However, it cannot be called in a CUDA Graph:
- When calling
cudaMemPrefetchAsyncwhile the stream is being captured, it will return error code 900, which is
- It also cannot be explicitly added to the graph because there is no node type corresponding with this operation, and if I call
cudaMemPrefetchAsyncfrom a host node in the graph, it will report another error code 800,
cudaErrorNotPermitted. I believe this is because of the limitation that host nodes in CUDA graph cannot call CUDA APIs.
cudaMemPrefetchAsync, I can move the data from device to host by explicity accessing all the memory on the host, and vice versa, though this has a huge overhead as you can imagine.
So I wonder if there is a way to call cudaMemPrefetchAsync or quickly move data using Unified Memory in a CUDA graph.
Many thanks in advance!