Employing CUDA Graphs in a Dynamic Environment

Originally published at: Employing CUDA Graphs in a Dynamic Environment | NVIDIA Technical Blog

Many workloads can be sped up greatly by offloading compute-intensive parts onto GPUs. In CUDA terms, this is known as launching kernels. When those kernels are many and of short duration, launch overhead sometimes becomes a problem. One way of reducing that overhead is offered by CUDA Graphs. Graphs work because they combine arbitrary numbers…

Hello thank You for sharing very intresting blog post! Is it possible to use CUDA graphs in a while loop - I mean I will execute the kernel multiple times until some condition will be met - Hence I do not know in advance how long should be sequence of kernel lounches in the graph - currently I manage by running loop inside the cooperative kernel and syncgrid(), but it would be more convinient to separate logic into multiple smaller kernels and avoid gridsync .

Thank you for your comment and question. It does not matter whether the kernels that you would like to put in a CUDA graph are executed in a for loop, a while loop, or any other construct, as long as the conditions for CUDA graphs are met. That means that the topology of the resulting graph does not change from one execution of your while loop to the next (this would allow you to use the graph update API), or that the exact same graph is encountered multiple times, so that it can be retrieved from a container in which it was stored upon first encounter.

Thank You for a reply ! Hovewer I probably did not made it clear I have a while loop where set of 4 kernels inside may be launched for example 200 or 10 or 300 - depending on data, and I am looking for a way to avoid penalty for launching kernel multiple times - so use CUDA graph.
Obviously I can create a graph with 4 kernel nodes and then launch it in while loop - still it will lead to starting couple hundred graphs - and I would like to make it one graph.
Still in order to execute it I suppose I would need some graph in form of a loop and be able to stop execution of this graph when given condition read from global memory will be present. I had seen in CUDA 11.6 cudaGraphNodeSetEnabled function can it be used to stop execution of a graph from inside of the kernel?