configuring workload manager on cluster with Nvidia Tesla s1070

Hi guys,

I am really confused :wacko:

we have - each node in cluster has 4 GPUs (nvidia tesla s1070) and 16 cores (4 quad CPU).
we want - create different queues (1) for pure CPU jobs and (2) for mixed GPU+CPU jobs.
Case (1) means that all jobs use CPU on a node.
Case (2) means that all jobs use GPU + CPU on a node.

How we can organise these by PBS or torque?
For CPU jobs we must create queue which have only 12 CPU per node. (other 4 CPU are used by GPU)
For GPU jobs we must create queue which have 4 CPU per node.

Is it necessary that 1 GPU requires 1 CPU to function or can we use 4 Virtual Processors(VP) on 2 cores to drive 4 GPU’s and
remaining 14 for CPU jobs. ?

does anyone have idea reagarding configuring workload manager on clusters have CPU+GPU combination for computing batch jobs??
I tried hard on the Internet but couldn’t get useful information…
any suggestions…thanks in advance

We use Sun grid engine rather than PBS/Torque, but our approach has been to treat the GPU as a consumable resource and just use a single job queue which allocates CPU cores in the standard way you would for MPI or OpenMp jobs. That way GPU jobs just become a subset of CPU jobs and run on the same queue. You can use the same basic scheduling templates that you would use for managing floating software licenses, but have the GPU as a per node resource rather than a global resource. When a node has a free GPU resource and CPU core, it will accept a new GPU job if one is queued. When it has only a free CPU core, it will accept new CPU jobs, and GPU jobs will sit in the job queue until a node with both a free CPU core and a free GPU resource becomes available. When there are no GPU jobs on the queue, nodes just process CPU jobs as if the GPUs don’t exist.

Thank you Mr.avidday…that cleared half of my confusion… :thumbup:

could you please help me regarding this…

I have also read on internet that…some sort of soft locks has to be set on GPU’s to prevent misuse by users in multi-user environment…

and workload managers like PBS/Torque doesn’t take care of Allocation of GPU’s

If we run two simultaneous cuda jobs…will they be executed on two different GPU’s or overlaps one over the other…

Is there any provision to set the Device no of GPU on which my job has to run…

Thanks in advance…

We don’t have that problem because we presently only have a single GPU per node, so that isn’t something I have personal experience. Somebody with a cluster of S1070s will have to help you there (or Massimo Fatica, who works for NVIDIA and posts here - he seems to be their compute cluster guru).

Having said that, I understand that the nvidia-smi utility has the ability to configure an S870/S1070 so that the cards go into “compute” exclusive mode, where the driver will only permit a single process per physical GPU. If a user tries to run on a GPU which is already in use when in compute exclusive, the program will fail to launch. That, combined with your scheduler and a bit of user discipline, should probably work in most circumstances.

That solved most of my problems Mr.avidday…Thank you

I searched in the internet for GPU compute exclusive mode what you have mentioned and found this link

https://www.wiki.ed.ac.uk/display/ecdfwiki/…-Exclusive+Mode

which is really helpful…

Mr.avidday you made my day…