I have a manually created cuda graph application with memory copy and kernel nodes. After calling cudaGraphInstantiate(), I record a start event before the cudaGraphLaunch() and an end event after that. I get my execution time of the whole application by the cudaEventElapsedTime, namely H2D + kernels + D2H. I have two questions. (1) is this the right way to measure time under cuda graph? (2) How can I measure the execution time excluding H2D and D2H? so i want only between the start of the first kernel and the end of the last kernel. If I dont add the memcpy node, the application is not complete.
Thanks a lot.