Best way to run a DAG of heterogenous tasks?

I have a workload running on the CPU today where one thread creates (and perhaps caches) DAGs of heterogeneous tasks and submits them to a runtime that sends tasks to a threadpool when their dependencies become available. I now want to support this pattern on GPUs. A few probably important details:

  • All task types features ample SIMD parallelism and threads within a thread block can collectively execute them with no issues.
  • The latency of each task type varies considerably, with the smallest and most common tasks taking on the order of a few microseconds each while the most expensive tasks take several milliseconds.
  • Each layer of the DAG is homogeneous in task type (and thus latency).
  • The DAG has no conditional execution.

The most natural way to express this computation seems to be to use CUDA graphs where each task is a kernel launch with one threadgroup with a healthy warp multiplier of threads per group (like 256 threads). However, I have a few concerns:

  • Will the thousands of short microsecond-ish kernels of one thread block get eaten alive by kernel launch overhead, even with the graph API’s amortization claims?
  • Does the GPU running many concurrent kernels (up to 128 on an H100 seems supported) provide a similar performance level to launching 1 kernels with hundreds of threadgroups? It seems like that while 128 kernels is theoretically more than the number of SMs on an H100, you would usually not schedule more than one per SM, which limits the GPUs ability to hide memory latency.

Alternatively, I can statically group multiple tasks within each layer of the DAG to create fewer, larger kernel launches at the expense of introducing false dependencies. Is there any guidance on best practices for partitioning fine grain tasks using the graph API?