Any simple wait-lock queuing library for heterogenous cuda systems?

Curious little problem that may be something few folks have, but before I try to solve it I wanted to know what was out there.

I have a “cuda sandbox” machine with an integrated GPU (SM1.1), a 295 (2xSM1.3) and a Tesla (SM1.3, much memory). I also use a build system (SCons) pushing cuda around with python-wrapped C/C++.

What I’d really like to be able to do is say “scons -uj2”, saying that two processes can run at once. I’d like the first of those processes to be able to say “I want a SM 1.3 card with 800MB of memory” and get half of the 295, then have the second one come along and automatically get the other half of the 295 that’s not in use. If I ran -uj3, the third process would get the Tesla, but with -uj4 the fourth process would say “there aren’t any cards available; I will sleep and check every so often to see when one opens up”

So:

  • Set min requirements for the card you want (SM and memory for now)
  • Get the first card not in use by another app matching those requirements (or the “least impressive” card matching those requirements)
  • If no cards matching min requirements are available, either return with an error (so the app can exit or wait) or just wait for a card to be available, sleeping nicely until it does

I can see how do to much of it except the “first card not used by another app” part… maybe that could be done with nvidia-smi, which I’ve seen referenced (but I don’t seem to have it). But it seemed a simple enough thing I figured someone else has probably solved it by now with a nice library (or just scrap of code). Any suggestions? I don’t want a full queueing system (in a sense this is a single multithreaded application with different threads on different cards; hopefully that would work OK, but it doesn’t make sense with a cluster-style queuing system)–just something simple.

Thanks!

I’m working on something just like this, a little runtime wrapper that allows you to execute a kernel and it runs on the “first available” device. It’s just something I’ve been doing in my spare time as part of another app I’m working on. Maybe when I get something to release quality I can release it…

You might start with some of the nice convenience functions in GPUWorker. It’s not exactly what you want but it’s a nice start.

You might want to take a look at these projects:

  1. http://runtime.bordeaux.inria.fr/StarPU/
  2. http://code.google.com/p/harmonyruntime/

They are both runtimes for heterogeneous systems that allow you to asynchronously queue up a bunch of kernels and have a runtime component map them to the “best” core in your system. Both runtimes support having a fat-binary that contains both a CPU and a GPU implementation of a kernel; the runtime will decide whether to run a given kernel on a host core or the gpu based on some predictive model.

StarPU allows you to define your program as an acyclic data-flow graph of kernels, while Harmony supports control-flow and cycles.

I’m not sure about StarPU, but writing code for Harmony is fairly difficult at the moment as there is no front-end to the runtime. You have to manually create kernel objects and point them to CPU and GPU implementations of kernels. I am working on a CUDA frontend to Harmory, but that is at least 6-10 months out.

OK, I’m going to be a complete minimalist and point out that you can do this all within the CUDA runtime API (as long as you run on linux).

call cudaGetDeviceProperties on all devices and supply a list of the valid ones to cudaSetValidDevices.

Set all cards to compute exclusive mode (system setup step, can only be done in linux).

Then you simply do not call cudaSetDevice in your host threads accessing cuda. The automatic context initialization on the first cuda* call will choose the first free GPU from your specified valid devices.

Check for the error code from the first cuda* call. If it returns the “no cuda devices were available” error, then you know that all of the valid GPUs are in use by other apps (or other threads within your app).

Note that this requires CUDA 2.2 or newer.