How do I know which GPU's / Mig's were allocated to my job?

oren.shani · June 24, 2021, 9:42am

Hi All,

As part of banchmarking DGX performance in a SLURM cluster, I need to know which resources were actually allocated to a specific job. (Which GPU’s / Mig’s? How much memory, etc.). So what is the best way to do that, (preferably using a python script ) ?

Many thanks,

Oren

ScottEllis · June 24, 2021, 2:12pm

Hi @oren.shani ,

Are you wanting to do this from within the job itself (e.g., SLURM_JOB_GPUS described in Slurm Workload Manager - Prolog and Epilog Guide ?) or from outside the job (e.g., with sacct -j $jobid)?

oren.shani · June 25, 2021, 4:59am

Hi Scott,

I am aware of the ways to see what slurm has allocated for the job. The problem is that it seems that tensorflow, pytorch, etc, do not “see” the same into, especially when mig is used. For example, tensorflow.config.list_physical_devices, shows that the whole gpu is available even if only few mig’s were allocated.

So my question is, if there are other ways to get more accurate details.

Thanks,

Oren

ScottEllis · July 6, 2021, 6:05pm

As you’re discovering MIG means dealing with things in a…unique way. :-) We’ve been plumbing NVML and DCGM (see GitHub - NVIDIA/gpu-monitoring-tools: Tools for monitoring NVIDIA GPUs on Linux) to be MIG aware, and right now that’s probably the best path - there’s Python bindings for NVML.

I haven’t yet tried to get resources in MIG slices via NVML, but I believe it works (and can give you what you want).

I don’t think TensorFlow yet can give you per-MIG-instance information. As you saw, it is still basing most everything on the parent GPU, regardless of the slice it’s pointed at.

ScottEllis · September 9, 2022, 11:28pm

Not sure what you mean @lenorecutter7 . Are you wanting to stress-test your system, or is this a MIG/scheduler question?

Help me understand what you’re trying to do.

ScottE

Topic		Replies	Views
Contribution: sprofile tool for CPU, RAM and GPU reporting of slurm jobs CUDA Programming and Performance python	0	279	February 16, 2024
Job Statistics with NVIDIA Data Center GPU Manager and SLURM Technical Blog	1	1511	October 20, 2022
What is the good way to use MIG on a slurm cluster? CUDA Setup and Installation	2	3587	April 16, 2021
Nvidia-smi shows all gpus although slurm allocates one nvc, nvc++ and nvfortran hpc	1	488	June 10, 2024
Support for mig devices in nvidia-smi Queries Docker and NVIDIA Docker ubuntu	0	1130	September 20, 2021
Dynamically change GPU instances using NVML api System Management and Monitoring (NVML)	0	81	September 10, 2024
Docker doesn't detect MIG gpu devices DGX Systems (Data Center) docker	7	4221	May 11, 2023
MIG Instances Utilization Calculation General Discussion cuda , kernel	0	94	August 4, 2025
showing gpu utlization per process CUDA Programming and Performance	5	2271	October 12, 2018
Getting the Most Out of the NVIDIA A100 GPU with Multi-Instance GPU Technical Blog	11	1729	January 19, 2023

How do I know which GPU's / Mig's were allocated to my job?

Related topics