How to queue CUDA tasks to a single or multiple GPU system

I’d like to design a CUDA processing system which can handle single or multiple threads from single node (context) , to single or multiple GPU devices.

Each thread (ie a task) uses a single GPU for CUDA calculations. Each GPU is a resource which shall be reserved for a single task at a any given time.

Currently I have multiple threads from a single process that compete on a single GPU resource, this causes an “unspecified launch failure” error message.

I’d like to queue the work according to an unoccupied GPU device, or wait until the device is ready to schedule a new task.

I’m very certain that I need a manager who can decide which task can access a certain device, put on hold, schedule work etc.

Here are my questions:

What is the best way to design such a system.
Shall I integrate my application using a 3rd party scheduling mechanism?
Is there a supported library for such case.
Is there a related CUDA product for such case? physx has a CUDA scheduler
Will the design support future multiple context scheduling system, using the same methodology.
Related posts:

I have notice TORQUE Resource Manager for multiple process (context) scheduling in the following post. This doesn’t seems to be the case that I’m looking for.

Likewise I have notice some related products like array fire multi GPU

I’m somehow very confused, and doesn’t really know where to start, I’m pretty sure that Nvidia has already encounter the problem. Any help would be very appreciated.

Hello,

In the computing servers we use a scheduler. There re many users and the jobs are submitted using very simple scripts and ran in the order of submissions. Here is an example of how it works on the first hit from google:

https://www.osc.edu/supercomputing/batch-processing-at-osc/job-submission

The root (admin) can also change the priority of the jobs. It should work as well for 1 computer if you want to submit many jobs and ran them 1 after each other.

http://pubs.opengroup.org/onlinepubs/009696799/utilities/qsub.html

It can be configured to run all the possibilities you mentioned.

Edit: I realized this is exactly like torque, sorry for the useless text.