Best way to run a DAG of heterogenous tasks?

rick46 · August 2, 2025, 1:06am

I have a workload running on the CPU today where one thread creates (and perhaps caches) DAGs of heterogeneous tasks and submits them to a runtime that sends tasks to a threadpool when their dependencies become available. I now want to support this pattern on GPUs. A few probably important details:

All task types features ample SIMD parallelism and threads within a thread block can collectively execute them with no issues.
The latency of each task type varies considerably, with the smallest and most common tasks taking on the order of a few microseconds each while the most expensive tasks take several milliseconds.
Each layer of the DAG is homogeneous in task type (and thus latency).
The DAG has no conditional execution.

The most natural way to express this computation seems to be to use CUDA graphs where each task is a kernel launch with one threadgroup with a healthy warp multiplier of threads per group (like 256 threads). However, I have a few concerns:

Will the thousands of short microsecond-ish kernels of one thread block get eaten alive by kernel launch overhead, even with the graph API’s amortization claims?
Does the GPU running many concurrent kernels (up to 128 on an H100 seems supported) provide a similar performance level to launching 1 kernels with hundreds of threadgroups? It seems like that while 128 kernels is theoretically more than the number of SMs on an H100, you would usually not schedule more than one per SM, which limits the GPUs ability to hide memory latency.

Alternatively, I can statically group multiple tasks within each layer of the DAG to create fewer, larger kernel launches at the expense of introducing false dependencies. Is there any guidance on best practices for partitioning fine grain tasks using the graph API?

Topic		Replies	Views
Easiest way to optimally execute arbitrary DAG of functions on a single GPU? CUDA Programming and Performance	4	610	February 27, 2018
Processing DAGs using streams CUDA Programming and Performance	0	192	February 24, 2024
Processing multiple graphs CUDA Programming and Performance	0	330	January 7, 2022
How to use CUDA to resolve a complex graph problem CUDA Programming and Performance	4	783	September 22, 2016
Simultaneous kernel executions not possible? Disappointing news for me CUDA Programming and Performance	7	6114	November 3, 2008
Parallel computing by cpu thread and gpu kernel CUDA Programming and Performance	5	1294	November 21, 2014
massive tasks cost too much time CUDA Programming and Performance	10	603	January 15, 2018
Advantage of Cuda Graphs? CUDA Programming and Performance	3	963	June 28, 2023
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3718	October 21, 2017
Cuda Master Slave CUDA Programming and Performance cuda	1	448	January 10, 2023

Best way to run a DAG of heterogenous tasks?

Related topics