dcgmUpdateAllFields returns "Timeout"

Hi NVIDIA experts,

I use golang DCGM API to watch fields and the update their values through Fields API.
nv-hostengine is launched as a standalone process in my setup. And DCGM client is connected through TCP-connection to a nv-hostengine using a default port and “socketAddress == localhost”. Both DCGM client and nv-hostengine are running on the the same host.

Everything works fine, but after 3 months days it starts to fail every invocation ofdcgm.UpdateAllFields() with “Timeout” error.
Restart of DCGM client along with nv-hostengine helps to overcome this issue.

The following versions of SW are used:

  • datacenter-gpu-manager/unstable,unstable,now 1:2.3.6 amd64
  • libnvidia-cfg1-450/unstable,now 450.80.02-0ubuntu1 amd64
  • libnvidia-compute-450/unstable,now 450.80.02-0ubuntu1 amd64
  • libnvidia-nscq-450/unstable,unstable,now 450.80.02-1 amd64
  • cat /sys/module/nvidia/version - 450.80.02

I have the followng questions:

  • Shall we consider using UNIX domain socket instead of TCP-connection between DCGM client and nv-hostengine?
  • Could you explain why DCGM_ST_TIMEOUT could be returned from dcgmUpdateAllFields, please? Can it be related to broken TCP connectivity? If so, maybe you should consider reconnecting to nv-hostengine, otherwise - client needs to close and recreate dcgmHandle_t handle.
  • By the way, DCGM_ST_TIMEOUT isn’t documented return value from dcgmUpdateAllFields. Shall ypou document it?

Hi @dmitrygx ,

This doesn’t look DGX-specific (this got posted in the DGX forum) - I assume you can replicate it on the non-DGX systems?

This sounds like something best submitted as an issue on the DCGM git ( GitHub - NVIDIA/DCGM: NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs ). Would you want to make an issue there?