dcgmUpdateAllFields returns "Timeout"

dmitrygx · December 22, 2022, 2:10pm

Hi NVIDIA experts,

I use golang DCGM API to watch fields and the update their values through Fields API.
nv-hostengine is launched as a standalone process in my setup. And DCGM client is connected through TCP-connection to a nv-hostengine using a default port and “socketAddress == localhost”. Both DCGM client and nv-hostengine are running on the the same host.

Everything works fine, but after 3 months days it starts to fail every invocation ofdcgm.UpdateAllFields() with “Timeout” error.
Restart of DCGM client along with nv-hostengine helps to overcome this issue.

The following versions of SW are used:

datacenter-gpu-manager/unstable,unstable,now 1:2.3.6 amd64
libnvidia-cfg1-450/unstable,now 450.80.02-0ubuntu1 amd64
libnvidia-compute-450/unstable,now 450.80.02-0ubuntu1 amd64
libnvidia-nscq-450/unstable,unstable,now 450.80.02-1 amd64
cat /sys/module/nvidia/version - 450.80.02

I have the followng questions:

Shall we consider using UNIX domain socket instead of TCP-connection between DCGM client and nv-hostengine?
Could you explain why DCGM_ST_TIMEOUT could be returned from dcgmUpdateAllFields, please? Can it be related to broken TCP connectivity? If so, maybe you should consider reconnecting to nv-hostengine, otherwise - client needs to close and recreate dcgmHandle_t handle.
By the way, DCGM_ST_TIMEOUT isn’t documented return value from dcgmUpdateAllFields. Shall ypou document it?

ScottEllis · December 27, 2022, 10:13pm

Hi @dmitrygx ,

This doesn’t look DGX-specific (this got posted in the DGX forum) - I assume you can replicate it on the non-DGX systems?

This sounds like something best submitted as an issue on the DCGM git ( GitHub - NVIDIA/DCGM: NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs ). Would you want to make an issue there?

Topic		Replies	Views
Failure during starting NVSM services DGX User Forum	5	748	March 1, 2024
Custom fields added to DCGM return 0 System Management and Monitoring (NVML)	0	636	September 14, 2020
Allowing multiple processes to watch DCGM profile fields System Management and Monitoring (NVML)	2	882	January 11, 2021
GPU-Operator 1.3.0 throws: nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06 Docker and NVIDIA Docker ubuntu	2	2371	December 11, 2020
DCGM does not export profile metrics after some period of time Miscellaneous Products (archived)	0	2496	June 1, 2021
[BUG] dwcgf error of NvSciIpcOpenEndpoint with shm header not cleared DRIVE AGX Orin General driveworks-cgf	5	921	May 9, 2023
DCGM installation OK, running? some issues Visual Profiler and nvprof	2	324	February 10, 2025
[nsys profile] gpu-metrics-devices fails with "Already under profiling" Profiling Linux Targets profiling	12	149	June 2, 2025
Something goes wrong with PCIe and Ubuntu freezes only mouse can move but cannot click several times a day on dgx station v100 Linux pcie , cuda , kernel	6	798	January 6, 2023
Unable to access Enterprise Support on https://nvid.nvidia.com (504 - Gateway Timeout) DGX User Forum	5	7881	August 16, 2021

dcgmUpdateAllFields returns "Timeout"

Related topics