Hi, I am trying to gather gpu and memory metrics from each individual vGPU. This is my current setup
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14 Driver Version: 525.105.14 |
|---------------------------------±-----------------------------±-----------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA A16 | 00000000:47:00.0 | 96% |
| 3252070918 NVIDIA A16-4A | 9107803 vc14wingpu01 | 95% |
| 3252071060 NVIDIA A16-4A | 9108788 vc14wingpu02 | 0% |
±--------------------------------±-----------------------------±-----------+
I am running a couple of vms in vCenter 8.0U1 on a esxi host configured to run the GPU in Shared Direct mode (vendor shared passthrough graphics). Both of these VMs are running Windows 10 and there is a constant gpu stress test that pushes each of the VMs GPU utilization to ~50%.
Issue: I am using the command nvidia-smi vgpu -q -i 00000000:47:00.0 -u
to gather instantaneous utilization of the vGPU and comparing it with the metrics that is reported by the Windows operating system (task manager performance tab). Nvidia-smi reports the following:
GPU vGPU sm mem enc dec
Idx Id % % % %
0 3252070918 94 0 0 0
0 3252071060 0 0 0 0
0 3252070918 94 0 0 0
0 3252071060 0 0 0 0
i.e. The utilization is always 0 in one of the vGPUs, and the utilization on the other vGPU is the “combined” utilization of both of Vms together (reported by the OS on each vm).
Questions:
- Why does the utilization reported by the OS and Nvidia-smi differ when there are multiple vGPU running on the same card?
- Is there a way using nvidia-smi, to gather metrics that closely matches what the OS reports (because in this case the metrics reported by the OS is more accurate)?