PyTorch with Slurm and MPS work-around --gres=gpu:1

rkudyba · April 3, 2020, 8:22pm

Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I’m wondering how the below sbatch file is sharing a GPU.

MPS is running on the head node:

ps -auwx|grep mps
root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:27 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d

The entire script is posted on SO here.

Here are the sbatch file contents:

#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval
#SBATCH --nodelist=node003
module purge
module load gcc5 cuda10.1
module load openmpi/cuda/64
module load pytorch-py36-cuda10.1-gcc
module load ml-pythondeps-py36-cuda10.1-gcc
python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee alex_100_imwoof_seq_longtrain_cv_$1.txt

From nvidia-smi on the compute node:

        Processes
    Process ID                  : 320467
    Type                    : C
    Name                    : python3.6
    Used GPU Memory         : 2369 MiB
    Process ID                  : 320574
    Type                    : C
    Name                    : python3.6
    Used GPU Memory         : 2369 MiB

# nvidia-smi -q -d compute

==============NVSMI LOG==============

Timestamp                           : Fri Apr  3 15:27:49 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:3B:00.0
Compute Mode                    : Default

[~]# nvidia-smi
Fri Apr  3 15:28:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   42C    P0    46W / 250W |   4750MiB / 32510MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    320467      C   python3.6                                   2369MiB |
|    0    320574      C   python3.6                                   2369MiB |
+-----------------------------------------------------------------------------+

From htop:

320574 ouruser 20   0 12.2G 1538M  412M R 502.  0.8 14h45:59 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320467 ouruser 20   0 12.2G 1555M  412M D 390.  0.8 14h45:13 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320654 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320656 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320658 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320660 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320661 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1
320655 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320657 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0
320659 ouruser 20   0 12.2G 1538M  412M R 55.8  0.8  3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1

Is PyTorch somehow working around Slurm and NOT locking a GPU since the user omitted --gres=gpu:1? How can I tell if MPS is really working?

Here’s scontrol showing the job details:

scontrol show job 2913
JobId=2913 JobName=sequentialBlur_alexnet_training_imagewoof_crossval
   UserId=wcharles(6108) GroupId=students(200) MCS_label=N/A
   Priority=4294900346 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=04:55:20 TimeLimit=365-00:00:00 TimeMin=N/A
   SubmitTime=2020-04-03T11:24:03 EligibleTime=2020-04-03T11:24:03
   AccrueTime=2020-04-03T11:24:03
   StartTime=2020-04-03T11:24:03 EndTime=2021-04-03T11:24:03 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-03T11:24:03
   Partition=defq AllocNode:Sid=ourcluster:166594
   ReqNodeList=node003 ExcNodeList=(null)
   NodeList=node003
   BatchHost=node003
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/ouruser/blur/run_seq_blur3.py 0
   WorkDir=/ouruser/blur
   StdErr=/ouruser/blur/slurm-2913.out
   StdIn=/dev/null
   StdOut=/ouruser/blur/slurm-2913.out
   Power=
MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT

and:

scontrol show job 2914
JobId=2914 JobName=sequentialBlur_alexnet_training_imagewoof_crossval
   UserId=ouruser(6108) GroupId=students(200) MCS_label=N/A
   Priority=4294900345 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=04:55:14 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-04-03T11:24:11 EligibleTime=2020-04-03T11:24:11
   AccrueTime=2020-04-03T11:24:11
   StartTime=2020-04-03T11:24:12 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-03T11:24:12
   Partition=defq AllocNode:Sid=ourcluster:166594
   ReqNodeList=node003 ExcNodeList=(null)
   NodeList=node003
   BatchHost=node003
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=/ouruser/blur/run_seq_blur3.py 1
   WorkDir=/ouruser/blur
   StdErr=/ouruser/blur/slurm-2914.out
   StdIn=/dev/null
   StdOut=/ouruser/blur/slurm-2914.out
   Power=
   MailUser=wcharles1@fordham.edu MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT

rkudyba · September 11, 2020, 7:41pm

First we found out that Bright Cluster’s version of Slurm does not include NVML support, so you need to compile it. They have several customers who do not have GPU’s so the extra logging that comes with NVML could potentially confuse users.

Then there was a problem with the additional call to run_seq_blur2.py . We were getting a list index error, when you run the sbatch script you have to add the additional parameter to the sbatch command. The user modified it that way to make it easier to run permutations of the Python file without changing the sbatch script. For example:

sbatch run_seq_blur3.py 0

where 0 can be any value from 0 - 4.

The final line in the sbatch file now looks like this:

python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0

Anyways, it no longer drains the node.

Topic		Replies	Views
Slurm not working for MPS and TensorRT Movie Lens tutorial Container: HPC tensorrt , cuda , hpc	4	2062	October 12, 2021
TensorFlow on sbatch/srun is way slower than on only srun or sbatch Container: HPC cuda , tensorflow	1	1117	October 18, 2022
Unable to run on more than 1 GPU Report a Bug (PhysicsNeMo Only)	3	1448	October 12, 2022
MPI running issue using NVIDIA MPS Service on Multi-GPU nodes CUDA Programming and Performance	4	2343	September 16, 2016
Slurm fails with multiple processes MPI_init errors , PML add procs failed Base Command Manager	7	2416	October 21, 2022
Slurm configuration issue with PMIX Base Command Manager	3	4142	October 19, 2022
CUDA MPS Not Working as Expected in Multi-GPU Environment CUDA Setup and Installation	4	833	November 12, 2024
Would like some help in running the xhpl 21.4 container on slurm Container: HPC	0	1240	November 4, 2022
Fail to launch CUDA-MPS CUDA Programming and Performance	9	8806	October 26, 2015
Is these processes are computed parallelly using MPS? General	3	772	November 22, 2019

PyTorch with Slurm and MPS work-around --gres=gpu:1

Related topics