Easiest way to optimally execute arbitrary DAG of functions on a single GPU?

Given a small list of functions, an arbitrary DAG representing the when to call the functions and where the outputs of them go, and a small starting initial data set, is there any library where I could just load all that (obviously formatted to some specification) into GPU ram and run the whole thing in a reasonably optimal manner? If not, is there a simple way to make one I’m missing?

While you can of course execute things like this on CPUs that could just call the GPU for each function, I don’t want the overhead of constantly sending things on and off the GPU because the functions I’m working with are very small and because I’m concerned about how optimal core parallelization for independent functions in the DAG would be.

That’s a pretty vague description. Do you have a reference implementation on the CPU? A light-weight solution could be based on function pointers, but this offers limited flexibility and generality. An approach with more overhead but great flexibility and generality could be based on run-time compilation.

[url]NVRTC (Runtime Compilation) :: CUDA Toolkit Documentation

Here’s an example for CPU clusters of what I’m looking for Custom Graphs — Dask documentation, in this case replacing each CPU in the cluster with a GPU core.

In essence, I have a small number of functions of basic functions. They’re organized into a DAG, indicating the general order functions have to be called, which functions have to be called before which, and where the outputs and inputs go, etc. The DAG starts by taking a small amount of initial data. I’d like a little task scheduling program running on the GPU that reads the DAG and has cores execute the functions accordingly.

Many of the functions I’m interested in running would take less than a microsecond to run on a core, and the transfer time to load even an empty kernel is around 3 microseconds, so I can’t just have the CPU issues a large number of calls.

The overhead for a null kernel is more like 5 usec. Depending on what these functions are, you may be able to set up a kernel that traverses the DAG, and where your mystery functions are simply device functions. It is also possible that your task is not suitable for GPU acceleration since it does not expose enough parallelism.

“Depending on what these functions are, you may be able to set up a kernel that traverses the DAG, and where your mystery functions are simply device functions.”

Could you please elaborate on this a little?