PyTorch with Slurm and MPS work-around --gres=gpu:1

Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I’m wondering how the below sbatch file is sharing a GPU.

MPS is running on the head node:

ps -auwx|grep mps
root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:27 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d

The entire script is posted on SO here.

Here are the sbatch file contents:

#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval
#SBATCH --nodelist=node003
module purge
module load gcc5 cuda10.1
module load openmpi/cuda/64
module load pytorch-py36-cuda10.1-gcc
module load ml-pythondeps-py36-cuda10.1-gcc
python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee alex_100_imwoof_seq_longtrain_cv_$1.txt

From nvidia-smi on the compute node:

        Processes
    Process ID                  : 320467
    Type                    : C
    Name                    : python3.6
    Used GPU Memory         : 2369 MiB
    Process ID                  : 320574
    Type                    : C
    Name                    : python3.6
    Used GPU Memory         : 2369 MiB

# nvidia-smi -q -d compute

==============NVSMI LOG==============

Timestamp                           : Fri Apr  3 15:27:49 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:3B:00.0
Compute Mode                    : Default

[~]# nvidia-smi
Fri Apr  3 15:28:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   42C    P0    46W / 250W |   4750MiB / 32510MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    320467      C   python3.6                                   2369MiB |
|    0    320574      C   python3.6                                   2369MiB |
+-----------------------------------------------------------------------------+

From htop:

320574 ouruser 20   0 12.2G 1538M  412M R 502.  0.8 14h45:59 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320467 ouruser 20   0 12.2G 1555M  412M D 390.  0.8 14h45:13 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320654 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320656 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320658 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320660 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320661 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320655 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320657 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320659 ouruser 20   0 12.2G 1538M  412M R 55.8  0.8  3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1

Is PyTorch somehow working around Slurm and NOT locking a GPU since the user omitted --gres=gpu:1? How can I tell if MPS is really working?

Here’s scontrol showing the job details:

scontrol show job 2913
JobId=2913 JobName=sequentialBlur_alexnet_training_imagewoof_crossval
   UserId=wcharles(6108) GroupId=students(200) MCS_label=N/A
   Priority=4294900346 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=04:55:20 TimeLimit=365-00:00:00 TimeMin=N/A
   SubmitTime=2020-04-03T11:24:03 EligibleTime=2020-04-03T11:24:03
   AccrueTime=2020-04-03T11:24:03
   StartTime=2020-04-03T11:24:03 EndTime=2021-04-03T11:24:03 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-03T11:24:03
   Partition=defq AllocNode:Sid=ourcluster:166594
   ReqNodeList=node003 ExcNodeList=(null)
   NodeList=node003
   BatchHost=node003
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/ouruser/blur/run_seq_blur3.py 0
   WorkDir=/ouruser/blur
   StdErr=/ouruser/blur/slurm-2913.out
   StdIn=/dev/null
   StdOut=/ouruser/blur/slurm-2913.out
   Power=
MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT

and:

scontrol show job 2914
JobId=2914 JobName=sequentialBlur_alexnet_training_imagewoof_crossval
   UserId=ouruser(6108) GroupId=students(200) MCS_label=N/A
   Priority=4294900345 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=04:55:14 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-04-03T11:24:11 EligibleTime=2020-04-03T11:24:11
   AccrueTime=2020-04-03T11:24:11
   StartTime=2020-04-03T11:24:12 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-03T11:24:12
   Partition=defq AllocNode:Sid=ourcluster:166594
   ReqNodeList=node003 ExcNodeList=(null)
   NodeList=node003
   BatchHost=node003
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/ouruser/blur/run_seq_blur3.py 1
   WorkDir=/ouruser/blur
   StdErr=/ouruser/blur/slurm-2914.out
   StdIn=/dev/null
   StdOut=/ouruser/blur/slurm-2914.out
   Power=
   MailUser=wcharles1@fordham.edu MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT

First we found out that Bright Cluster’s version of Slurm does not include NVML support, so you need to compile it. They have several customers who do not have GPU’s so the extra logging that comes with NVML could potentially confuse users.

Then there was a problem with the additional call to run_seq_blur2.py . We were getting a list index error, when you run the sbatch script you have to add the additional parameter to the sbatch command. The user modified it that way to make it easier to run permutations of the Python file without changing the sbatch script. For example:

sbatch run_seq_blur3.py 0

where 0 can be any value from 0 - 4.

The final line in the sbatch file now looks like this:

python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0

Anyways, it no longer drains the node.