I’m currently managing a small cluster of computers through the Torque/Maui resource manager. A subset of the machines in my cluster possess GPUs. Some of them do have only one card whereas others possess many cards. What I would like to achieve is to configure Torque such that I can launch jobs which require a GPU on machines possessing one and that not more than one job is launched per card.
On the CUDA side, I’ve set all my cards in the compute mode which allows only one task at a given time. This works fine.
Now, the problem comes with the configuration of Torque/Maui. Looking at the documentation, the solution that has been tried is to use the GRES property for the computers possessing GPUs. Pretty much like node-locked licenses. When a user wants to run a GPU-job, he sends it in the “GPU-queue” (list of computers possessing GPUs) and asks for as many cards the job requires with the option “-W x=GRES:gpu@4” (without quotes) for 4 GPUs. (http://www.clusterresources.com/products/torque/docs/2.1jobsubmission.shtml#gres)
The problem is that this doesn’t work. The job is properly assigned to a node possessing GPUs, but the number of available GPUs on the node is not decremented. This means that when another job is launched, the same node will be used which will cause the job to crash as the cards do not accept more than one job.
Is this the correct way to configure Torque/MAUI ? If not, could you point me to a good documentation ?
Thanks in advance for any help that you could bring.