Complex scheduling of thousands of tasks

This is a question about both OpenCL and CUDA.

Both these APIs provide building-blocks for scheduling asynchronous execution of tasks, with event-based dependencies etc. (CUDA even lets you create dataflow graphs and schedule them as such, but let’s ignore that for the sake of this discussion)

However, when you are faced with the need to schedule many thousands of tasks - execution and copying; and have many of them reuse the same buffers; and you may even want to have different threads deciding they want to schedule work; and you need to decide how many threads to perform scheduling on; and how to estimate execution time to optimize scheduling; etc. - you start thinking about a somewhat higher-level framework/library. I know for a fact that different organizations end up developing such a mechanism, to serve their own specific set of needs. But - are there popular FOSS solutions for doing this? Perhaps in the HPC space with which I’m less familiar?

There is Legion which NVIDIA has contributed to (and here). This overview may be of interest. Legate is a numpy-like environment that can take advantage of Legion for large-scale tasks.