Job Statistics with NVIDIA Data Center GPU Manager and SLURM

Originally published at: Job Statistics with NVIDIA Data Center GPU Manager and SLURM | NVIDIA Technical Blog

Resource management software, such as SLURM, PBS, and Grid Engine, manages access for multiple users to shared computational resources. The basic unit of resource allocation is the “job”, a set of resources allocated to a particular user for a period of time to run a particular task. Job level GPU usage and accounting enables both users…

Using -c allgpus the solution proposed does not work for non exclusive nodes.

You may want to build upon these instead:

prolog:

#!/bin/sh

group=$(sudo -u $SLURM_JOB_USER dcgmi group -c j$SLURM_JOB_ID)
if [ $? -eq 0 ]; then
  groupid=$(echo $group | awk '{print $10}')
  sudo -u $SLURM_JOB_USER dcgmi group --group $groupid --add $SLURM_JOB_GPUS
  sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --enable
  sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --jstart $SLURM_JOBID
fi

… it would be great if dcgmi group -c would do json as well.

epilog:

#!/bin/sh

OUTPUTDIR=/tmp/
sudo -u $SLURM_JOB_USER dcgmi stats --jstop $SLURM_JOBID
sudo -u $SLURM_JOB_USER dcgmi stats --verbose --job $SLURM_JOBID | sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out

groupid=$(sudo -u $SLURM_JOB_USER dcgmi group -l --json | jp  "body.Groups.children.[*][0][?children.\"Group Name\".value=='j$SLURM_JOBID'].children.\"Group ID\".value | [0] " | sed s/\"//g)

sudo -u $SLURM_JOB_USER dcgmi group --delete $groupid

… this requires jp.