Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I’m wondering how the below sbatch
file is sharing a GPU.
MPS is running on the head node:
ps -auwx|grep mps
root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:27 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d
The entire script is posted on SO here.
Here are the sbatch
file contents:
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval
#SBATCH --nodelist=node003
module purge
module load gcc5 cuda10.1
module load openmpi/cuda/64
module load pytorch-py36-cuda10.1-gcc
module load ml-pythondeps-py36-cuda10.1-gcc
python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee alex_100_imwoof_seq_longtrain_cv_$1.txt
From nvidia-smi
on the compute node:
Processes
Process ID : 320467
Type : C
Name : python3.6
Used GPU Memory : 2369 MiB
Process ID : 320574
Type : C
Name : python3.6
Used GPU Memory : 2369 MiB
# nvidia-smi -q -d compute
==============NVSMI LOG==============
Timestamp : Fri Apr 3 15:27:49 2020
Driver Version : 440.33.01
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:3B:00.0
Compute Mode : Default
[~]# nvidia-smi
Fri Apr 3 15:28:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 42C P0 46W / 250W | 4750MiB / 32510MiB | 32% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 320467 C python3.6 2369MiB |
| 0 320574 C python3.6 2369MiB |
+-----------------------------------------------------------------------------+
From htop:
320574 ouruser 20 0 12.2G 1538M 412M R 502. 0.8 14h45:59 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320467 ouruser 20 0 12.2G 1555M 412M D 390. 0.8 14h45:13 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320654 ouruser 20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320656 ouruser 20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320658 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320660 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320661 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320655 ouruser 20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320657 ouruser 20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320659 ouruser 20 0 12.2G 1538M 412M R 55.8 0.8 3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
Is PyTorch somehow working around Slurm and NOT locking a GPU since the user omitted --gres=gpu:1
? How can I tell if MPS is really working?
Here’s scontrol
showing the job details:
scontrol show job 2913
JobId=2913 JobName=sequentialBlur_alexnet_training_imagewoof_crossval
UserId=wcharles(6108) GroupId=students(200) MCS_label=N/A
Priority=4294900346 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=04:55:20 TimeLimit=365-00:00:00 TimeMin=N/A
SubmitTime=2020-04-03T11:24:03 EligibleTime=2020-04-03T11:24:03
AccrueTime=2020-04-03T11:24:03
StartTime=2020-04-03T11:24:03 EndTime=2021-04-03T11:24:03 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-03T11:24:03
Partition=defq AllocNode:Sid=ourcluster:166594
ReqNodeList=node003 ExcNodeList=(null)
NodeList=node003
BatchHost=node003
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/ouruser/blur/run_seq_blur3.py 0
WorkDir=/ouruser/blur
StdErr=/ouruser/blur/slurm-2913.out
StdIn=/dev/null
StdOut=/ouruser/blur/slurm-2913.out
Power=
MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT
and:
scontrol show job 2914
JobId=2914 JobName=sequentialBlur_alexnet_training_imagewoof_crossval
UserId=ouruser(6108) GroupId=students(200) MCS_label=N/A
Priority=4294900345 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=04:55:14 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2020-04-03T11:24:11 EligibleTime=2020-04-03T11:24:11
AccrueTime=2020-04-03T11:24:11
StartTime=2020-04-03T11:24:12 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-03T11:24:12
Partition=defq AllocNode:Sid=ourcluster:166594
ReqNodeList=node003 ExcNodeList=(null)
NodeList=node003
BatchHost=node003
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/ouruser/blur/run_seq_blur3.py 1
WorkDir=/ouruser/blur
StdErr=/ouruser/blur/slurm-2914.out
StdIn=/dev/null
StdOut=/ouruser/blur/slurm-2914.out
Power=
MailUser=wcharles1@fordham.edu MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT