CUDA graph timing measurement


I have a manually created cuda graph application with memory copy and kernel nodes. After calling cudaGraphInstantiate(), I record a start event before the cudaGraphLaunch() and an end event after that. I get my execution time of the whole application by the cudaEventElapsedTime, namely H2D + kernels + D2H. I have two questions. (1) is this the right way to measure time under cuda graph? (2) How can I measure the execution time excluding H2D and D2H? so i want only between the start of the first kernel and the end of the last kernel. If I dont add the memcpy node, the application is not complete.

Thanks a lot.

I generally just use a profiler to get information about various pieces of a complex graph.

Hi Robert,

Thanks for the reply. My understanding was one should use nvprof when measure individual kernel execution time or certain APIs. Using events if want to measure a portion of the application, for example the time between the first kernel and the fifth kernel. Now i am interested in the latter.

Could you elaborate a bit what do you mean by using profiler? I can sort of dragging a time line indicator between two kernels in nvvp, but that is only for one execution. I cannot do it for 100 executions then get the median value, right?

Back to cuda graph, is it even possible to measure a complex graph execution time without H2D and D2H memcpy time?

The methods you might want to consider would probably depend on where exactly in the graph the D2H and H2D operations are, and what exactly is meant by “execution time without H2D and D2H memcpy time”. If the D2H and H2D operations are scattered throughout the graph, I really have no idea what that means. Graphs also have inherent possibility for concurrency, so depending on the graph and concurrency scenarios of D2H/H2D with other operations, I have even less idea of what that means.

Hi Robert,

Sorry for the confusion. Here is a simplified description. My application consists of one H2D node -> kernel node A -> kernel node B -> one D2H node. They share such linear dependency. So no Memcpy/Compute overlap.

They are in a graph, and I want to measure the time of Kernel A + Kernel B. How?

This might be an over-simplified example. I might have another 50 kernels with complicated dependencies between A and B. But I want to measure the time of my application without H2D and D2H communication.

And H2D is memory copy operation to copy an image from host to device memory. D2H vice versa. Both pinned memory, async transfer, if that matters.

are you using stream capture or are you building the graph using the graph API?

Manually, using graph api

For manual graph creation, I would probably insert a host node in the graph after the H2D node, and another host node in the graph before the D2H node.

I would use these host nodes to do host-based timing.

If I were using stream capture, I would probably insert cudaEventRecord statements in those same locations during the stream capture:

After launching the graph and completing the graph execution, I would perform an elapsed time query on the two events.

And if you want to launch your 100 graph executions back-to-back, without any intervening code, either of these methods will probably require some special handling. For the manual capture method, your host nodes will need to save their data away somewhere, perhaps using a vector push_back or similar.

Thanks Robert, that answers my question.