nvidia-smi shows last GPU K80 (out of 8) is always busy

sergei.ponomarev · December 15, 2017, 9:59pm

Dear CUDA experts,

We have 14 GPU nodes, each node has 8 GPUs (K80). I’ve noticed that on empty GPU nodes “nvidia-smi” always shows non-zero GPU utilization for the last GPU device (7th that is, counting from 0 to 7). The number is usually >50% and keeps changing, however, I also see “No running processes found”. What does that mean, any ideas, is that normal? What is that GPU doing? Thank you in advance!

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |<–
±----------------------------------------------------------------------------+

njuffa · December 15, 2017, 10:47pm

And this is steady state with the machine idling? I am asking because all GPUs are in power state P0 = “full power” (and elevated power draw because of that), as if this this snapshot had been taken at the end of a CUDA-accelerated app, after GPU activity ceased but before the app terminated (and power state dropped to a power-saving state like P8).

BTW, it is interesting to see that the odd-numbered GPU in each pair has higher power consumption and temperature. Weird.

Robert_Crovella · December 15, 2017, 10:50pm

With respect to the nvidia-smi reporting utilization percentage on one of the GPUs, this is normal behavior. The act of running nvidia-smi generates momentary utilization on one of the GPUs, typically.

sergei.ponomarev · December 18, 2017, 3:08pm

Hi, Njuffa and Txbob:

Yes, this is a steady state with the node idling, i.e. absolutely NO APPLICATIONS running by users. “the odd-numbered GPU in each pair has higher power consumption and temperature”, - this is probably related to hardware configuration:

$ nvidia-smi topo -m
GPU 0 1 2 3 4 5 6 7 mlx4_0 CPU Affinity
0 X PIX SOC SOC SOC SOC SOC SOC SOC 0-5
1 PIX X SOC SOC SOC SOC SOC SOC SOC 0-5
2 SOC SOC X PIX PHB PHB PHB PHB PHB 6-11
3 SOC SOC PIX X PHB PHB PHB PHB PHB 6-11
4 SOC SOC PHB PHB X PIX PXB PXB PHB 6-11
5 SOC SOC PHB PHB PIX X PXB PXB PHB 6-11
6 SOC SOC PHB PHB PXB PXB X PIX PHB 6-11
7 SOC SOC PHB PHB PXB PXB PIX X PHB 6-11
mlx4_0 SOC SOC PHB PHB PHB PHB PHB PHB X

Legend:

X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

Txbob,

Your answer reminded me of the Heisenberg’s Uncertainty principle: “every measurement necessarily has to disturb the quantum particle, which distorts the results of any further measurements” (:

That does make sense, though. However, we have also discovered that enabling the “Persistence Mode”
on all GPU devices of the node gets rid of the non-zero GPU utilization on the last GPU device, in other words, the last GPU device shows 0% utilization in that case. My question is, should we turn on “Persistence Mode” on all GPU nodes now and would it make any difference in regard to GPU device functionality/performance (isn’t this regime deprecated, Driver Persistence :: GPU Deployment and Management Documentation)? The problem is that I DO need all gpus to show %0 utilization for my script which greps for utilization from nvidia-smi and then reserves the free gpu devices.

Thank you so much for your answers, much appreciated!

Robert_Crovella · December 18, 2017, 3:15pm

That’s correct, persistence mode will eliminate this effect. Persistence mode keeps the GPU(s) in a fully active state whether they are being utilized or not. In this fully active state, the process of querying the GPU by nvidia-smi does not generate “Utilization”.

I personally would not build a GPU allocator that relies on this utilization percentage as an inferential measure of whether or not the GPU is in-use. Rather I would use a non-inferential method such as a job scheduler. If you require this particular methodology, however, then enabling persistence mode should help. There shouldn’t be any problems with enabling persistence mode, however GPU average power utilization will be higher in the idle state.

sergei.ponomarev · December 18, 2017, 3:29pm

Thank you for the quick response, Txbob. The problem with Univa Grid Engine scheduler is that it does not discriminate b/w GPU devices, i.e., it won’t tell you which GPU devices are free or busy (which is required for running such MD packages as ACEMD and NAMD). Also, the scheduler can be easily confused when users declare 1 gpu but then use 8, for example. That happens quite often. Thank you so much for your feedback, I will let our technical lead know that we can turn on “persistence mode” for all GPU devices/nodes.

Robert_Crovella · December 18, 2017, 3:42pm

There are plenty of GPU-aware job schedulers/resource managers.

Univa claims to be one of these:

[url]Resource Library | Altair

Perhaps you are not using it correctly.

sergei.ponomarev · December 18, 2017, 4:01pm

Thank you for sending the link, Txbob. This is what we need to figure out!

njuffa · December 18, 2017, 7:14pm

That seems wrong. Best I know, other task scheduling facilities such as LSF control the available GPUs with CUDA_VISIBLE_DEVICES, so a scheduled task cannot grab more GPUs than they reserved or are entitled to per queue restrictions.

Topic		Replies	Views
nvidia-smi GPU-Util abnormal CUDA Setup and Installation	2	2157	August 15, 2014
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	27328	March 19, 2015
nvidia-smi LINUX correct? CUDA Programming and Performance	1	747	May 4, 2019
Enquiry about "GPU-Util" item from nvidia-smi output CUDA Setup and Installation	2	1421	April 30, 2020
Enquiry about “GPU-Util” item from output of nvidia-smi Linux	0	797	July 9, 2017
The GPU concurrentcy and how to monitor GPU utilization. The nvidia-smi tool always show two utilization 0 or 100%. CUDA Programming and Performance	2	3586	June 1, 2017
nvidia-smi Volatile GPU-Util 100%, always, reboot operating system can not fix CUDA Setup and Installation	6	12935	November 30, 2020
Nvidia-SMI reporting 0% gpu utilization Drivers - Linux, Windows, MacOS linux , nvidia-smi , linux-driver	2	5742	August 3, 2023
NVIDIA-SMI 0 utilization while training TAO Toolkit	2	1177	November 14, 2019
nvidia-smi reports phantom utilization reported on one GPU Linux	0	639	June 22, 2018

nvidia-smi shows last GPU K80 (out of 8) is always busy

Related topics