Originally published at: Job Statistics with NVIDIA Data Center GPU Manager and SLURM | NVIDIA Technical Blog
Resource management software, such as SLURM, PBS, and Grid Engine, manages access for multiple users to shared computational resources. The basic unit of resource allocation is the “job”, a set of resources allocated to a particular user for a period of time to run a particular task. Job level GPU usage and accounting enables both users…
Using -c allgpus the solution proposed does not work for non exclusive nodes.
You may want to build upon these instead:
prolog:
#!/bin/sh
group=$(sudo -u $SLURM_JOB_USER dcgmi group -c j$SLURM_JOB_ID)
if [ $? -eq 0 ]; then
groupid=$(echo $group | awk '{print $10}')
sudo -u $SLURM_JOB_USER dcgmi group --group $groupid --add $SLURM_JOB_GPUS
sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --enable
sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --jstart $SLURM_JOBID
fi
… it would be great if dcgmi group -c would do json as well.
epilog:
#!/bin/sh
OUTPUTDIR=/tmp/
sudo -u $SLURM_JOB_USER dcgmi stats --jstop $SLURM_JOBID
sudo -u $SLURM_JOB_USER dcgmi stats --verbose --job $SLURM_JOBID | sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out
groupid=$(sudo -u $SLURM_JOB_USER dcgmi group -l --json | jp "body.Groups.children.[*][0][?children.\"Group Name\".value=='j$SLURM_JOBID'].children.\"Group ID\".value | [0] " | sed s/\"//g)
sudo -u $SLURM_JOB_USER dcgmi group --delete $groupid
… this requires jp.