Processing DAGs using streams

I would like to process a DAG as concurrently as possible. Each node in the graph corresponds to a kernel and it cannot begin execution until its parent kernels have completed. My initial thought was to use streams that wait on their parents through cudaWaitEvent, but I’m not sure how I would scale this up to graphs with a large amount nodes. What is the most efficient way to approach such a task?

Thank you so much in advance!