Is there a way to call cudaMemPrefetchAsync in a CUDA Graph?

I am currently experimenting some optimizations on the memory usage of CUDA applications.

I want to programatically move data between the host (CPU) and the device (GPU). cudaMemPrefetchAsync does allows me to move data using Unified Memory, when called directly.

However, it cannot be called in a CUDA Graph:

  • When calling cudaMemPrefetchAsync while the stream is being captured, it will return error code 900, which is cudaErrorStreamCaptureUnsupported.
  • It also cannot be explicitly added to the graph because there is no node type corresponding with this operation, and if I call cudaMemPrefetchAsync from a host node in the graph, it will report another error code 800, cudaErrorNotPermitted. I believe this is because of the limitation that host nodes in CUDA graph cannot call CUDA APIs.

Without cudaMemPrefetchAsync, I can move the data from device to host by explicity accessing all the memory on the host, and vice versa, though this has a huge overhead as you can imagine.

So I wonder if there is a way to call cudaMemPrefetchAsync or quickly move data using Unified Memory in a CUDA graph.

Many thanks in advance!