As part of banchmarking DGX performance in a SLURM cluster, I need to know which resources were actually allocated to a specific job. (Which GPU’s / Mig’s? How much memory, etc.). So what is the best way to do that, (preferably using a python script ) ?
Hi @oren.shani ,
Are you wanting to do this from within the job itself (e.g.,
SLURM_JOB_GPUS described in Slurm Workload Manager - Prolog and Epilog Guide ?) or from outside the job (e.g., with
sacct -j $jobid)?
I am aware of the ways to see what slurm has allocated for the job. The problem is that it seems that tensorflow, pytorch, etc, do not “see” the same into, especially when mig is used. For example, tensorflow.config.list_physical_devices, shows that the whole gpu is available even if only few mig’s were allocated.
So my question is, if there are other ways to get more accurate details.
As you’re discovering MIG means dealing with things in a…unique way. :-) We’ve been plumbing NVML and DCGM (see GitHub - NVIDIA/gpu-monitoring-tools: Tools for monitoring NVIDIA GPUs on Linux) to be MIG aware, and right now that’s probably the best path - there’s Python bindings for NVML.
I haven’t yet tried to get resources in MIG slices via NVML, but I believe it works (and can give you what you want).
I don’t think TensorFlow yet can give you per-MIG-instance information. As you saw, it is still basing most everything on the parent GPU, regardless of the slice it’s pointed at.