I’d like to share a tool which collects actually consumed resources from a slurm job.
For CPU and RAM it reads cgroup accounting data.
For GPU utilization it collects accounting data from the driver via the nvml library.
The output looks like this:
-- sprofile report (node27) --
Time: 0:00:25 / 1:00:00
CPU load: 0.9 / 2.0
RAM peak: 3G / 8G
GPU load: 0.9 / 1.0
GPU peak mem: 3G / 32G
GPU energy: 0.0kWh
It is akin to DGCM but more suited to collect data at the job-level granularity. Furthermore, it can be installed without administrator rights by a user which is more convenient.
I hope it can help cluster users adjust their resource reservations better.