Job Statistics with NVIDIA Data Center GPU Manager and SLURM

jwitsoe · August 25, 2020, 11:50pm

Originally published at: Job Statistics with NVIDIA Data Center GPU Manager and SLURM | NVIDIA Technical Blog

Resource management software, such as SLURM, PBS, and Grid Engine, manages access for multiple users to shared computational resources. The basic unit of resource allocation is the “job”, a set of resources allocated to a particular user for a period of time to run a particular task. Job level GPU usage and accounting enables both users…

koetter · October 20, 2022, 9:55am

Using -c allgpus the solution proposed does not work for non exclusive nodes.

You may want to build upon these instead:

prolog:

#!/bin/sh

group=$(sudo -u $SLURM_JOB_USER dcgmi group -c j$SLURM_JOB_ID)
if [ $? -eq 0 ]; then
  groupid=$(echo $group | awk '{print $10}')
  sudo -u $SLURM_JOB_USER dcgmi group --group $groupid --add $SLURM_JOB_GPUS
  sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --enable
  sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --jstart $SLURM_JOBID
fi

… it would be great if dcgmi group -c would do json as well.

epilog:

#!/bin/sh

OUTPUTDIR=/tmp/
sudo -u $SLURM_JOB_USER dcgmi stats --jstop $SLURM_JOBID
sudo -u $SLURM_JOB_USER dcgmi stats --verbose --job $SLURM_JOBID | sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out

groupid=$(sudo -u $SLURM_JOB_USER dcgmi group -l --json | jp  "body.Groups.children.[*][0][?children.\"Group Name\".value=='j$SLURM_JOBID'].children.\"Group ID\".value | [0] " | sed s/\"//g)

sudo -u $SLURM_JOB_USER dcgmi group --delete $groupid

… this requires jp.

Topic		Replies	Views
DCGM not reporting Max Memory Used correctly. Other Tools	1	676	January 30, 2020
Setting Up GPU Telemetry with NVIDIA Data Center GPU Manager Technical Blog	4	552	October 22, 2020
DCGM reporting Max GPU Memory Used is 0 . Linux	1	765	January 30, 2020
manage jobs in multi-gpu system with compute exclusive mode or not CUDA Programming and Performance	14	4232	September 3, 2010
how to collect GPU statistics ? CUDA Programming and Performance	6	4975	May 18, 2008
configuring workload manager on cluster with Nvidia Tesla s1070 CUDA Programming and Performance	4	3174	July 26, 2009
GPU performance monitoring from host shell CUDA Programming and Performance	1	2607	November 27, 2008
Monitoring GPU Utilization "Top" like utility for GPU CUDA Programming and Performance	8	6475	July 28, 2010
GPU load monitoring tool Now available! CUDA Programming and Performance	21	81546	December 7, 2009
Cuda + Torque + Maui? how to use queueing system with GPUs? CUDA Programming and Performance	7	21929	December 28, 2010

Job Statistics with NVIDIA Data Center GPU Manager and SLURM

Related topics