MIG Instances Utilization Calculation

sabavath.n · August 4, 2025, 12:16pm

Hi

I’ve configured a Kubernetes cluster on a MIG-enabled GPU node and partitioned the GPU into 4 distinct MIG instances, each with a different slice configuration (2g.20gb, 1g.10gb, 3g.40gb, 1g.10gb).

System Details :
Driver Version: 555.42.06
CUDA Version: 12.5
GPU: NVIDIA H100 80GB HBM3

MIG Device	Slice Type	SMs	Memory
GPU-I 3	3g.40gb	60	40 GB
GPU-I 2	2g.20gb	32	20 GB
GPU-I 0	1g.10gb	16	10 GB
GPU-I 1	1g.10gb	16	10 GB

I used the following DCGM command to collect GPU utilization metrics

dcgmi dmon -e 1001,1004 -g 2

i got

Entity GRACT TENSO

GPU 0 0.422 0.876
GPU-I 3 0.984 0.816 # 3g.40gb
GPU-I 2 0.989 0.913 # 2g.20gb
GPU-I 0 0.995 0.965 # 1g.10gb
GPU-I 1 0.995 0.964 # 1g.10gb

According to the documentation ( Feature Overview — NVIDIA DCGM Documentation latest documentation )

the formula above with our actual usage values results in a GPU 0 GRACT value that does not match the value reported by DCGM (0.422). This discrepancy appears when MIG instances are of different slice types, unlike the examples given in the documentation which use identical slices.

Questions:

What is the correct method to calculate the overall GPU utilization (GRACT, TENSO) when MIG instances are heterogeneous (e.g., 3g, 2g, 1g)? Please provide a formula or working example for mixed-slice setups.
Why does the GPU 0 GRACT not match the sum of weighted instance utilizations in our case?
Are there any internal weights or normalization factors beyond SM count?
Is dcgmi dmon the only officially supported way to monitor per-slice and full GPU utilization?
Do other tools like DCGM APIs, NVML, or NVIDIA NSIGHT offer more accurate or detailed slice-level telemetry?

We would highly appreciate any clarification or documentation references on these topics. This would help us accurately monitor GPU workloads in production Kubernetes environments using MIG

Topic		Replies	Views
Method of calculating GPU utilization when applying NVIDIA Multi-Instance GPU System Management and Monitoring (NVML) hw , nvbugs	5	1436	January 12, 2021
DCGM_FI_PROF_SM_ACTIVE is showing a value higher than 100% for MIG devices System Management and Monitoring (NVML)	0	581	February 15, 2024
MIG performance CUDA Programming and Performance	15	1097	November 28, 2024
MIG's multi-Compute Instance (CI) Use case? CUDA Programming and Performance	2	526	November 10, 2020
Docker doesn't detect MIG gpu devices DGX Systems (Data Center) docker	7	4206	May 11, 2023
Getting the Most Out of the NVIDIA A100 GPU with Multi-Instance GPU Technical Blog	11	1713	January 19, 2023
How to sharing GPU Memory between different GPU instance in MIG? DGX Systems (Data Center)	0	807	March 25, 2022
Dividing NVIDIA A30 GPUs and Conquering Multiple Workloads Technical Blog	0	385	August 30, 2022
Why MIG instances can only sum up to 98 SMs instead of the 108 SMs available? DGX Systems (Data Center)	1	1008	June 25, 2023
Support for mig devices in nvidia-smi Queries Docker and NVIDIA Docker ubuntu	0	1129	September 20, 2021

MIG Instances Utilization Calculation

Entity GRACT TENSO

Questions:

Related topics