In my encoder, for each image, I have copy from host to device on stream 0, then a series of N kernels on stream 1, then copy from device to host on stream 2. The streams are synchronized via events, so that stream 1 kernels only execute when host to device copy is complete. I schedule 40 encodes at a time, and when I get the callback after the 20th encode, I schedule another 40 (in another thread). So, my work flow seems to meet the use case for Graphs. What can I gain by using a Graph to capture the N kernels on stream 1 ?
Why not just try CUDA Graphs and find out whether it benefits your use case?
If many of the kernels have very short runtime, the use of CUDA Graphs can significantly reduce overall launch overhead, resulting in higher performance.
Thanks, kernel runtime is of order of milliseconds. Is that considered short ?
Kernel launch overhead is 3 to 5 microsecond on modern high-end systems. So if the kernel runtime is in the millisecond range (so about 1000x difference), kernel launch overhead is pretty much irrelevant. Actually, kernel run times in the single digit milliseconds on high-end systems are about the sweet spot for CUDA accelerated apps, in particular with regard to the user interface. At any time, the performance difference between lowest-end and highest-end GPUs is typically around 20x, so such software tends to be reasonably responsive even on low-end hardware.
The other aspect of using CUDA Graphs is the convenience aspect (ciuld be summarized as “capture and replay”). Since your software appears to be already complete and tuned this does not look like a compelling argument at this stage, but you may disagree.
As I said, one approach is to just give it a try: run some experiments and see how you like it. You may discover advantageous aspects of using CUDA Graphs that a mere thought experiment is not going to uncover.