Hi NVIDIA experts,
I use golang DCGM API to watch fields and the update their values through Fields API.
nv-hostengine is launched as a standalone process in my setup. And DCGM client is connected through TCP-connection to a nv-hostengine using a default port and “socketAddress == localhost”. Both DCGM client and nv-hostengine are running on the the same host.
Everything works fine, but after 3 months days it starts to fail every invocation ofdcgm.UpdateAllFields()
with “Timeout” error.
Restart of DCGM client along with nv-hostengine helps to overcome this issue.
The following versions of SW are used:
- datacenter-gpu-manager/unstable,unstable,now 1:2.3.6 amd64
- libnvidia-cfg1-450/unstable,now 450.80.02-0ubuntu1 amd64
- libnvidia-compute-450/unstable,now 450.80.02-0ubuntu1 amd64
- libnvidia-nscq-450/unstable,unstable,now 450.80.02-1 amd64
-
cat /sys/module/nvidia/version
- 450.80.02
I have the followng questions:
- Shall we consider using UNIX domain socket instead of TCP-connection between DCGM client and nv-hostengine?
- Could you explain why
DCGM_ST_TIMEOUT
could be returned fromdcgmUpdateAllFields
, please? Can it be related to broken TCP connectivity? If so, maybe you should consider reconnecting to nv-hostengine, otherwise - client needs to close and recreatedcgmHandle_t
handle. - By the way,
DCGM_ST_TIMEOUT
isn’t documented return value fromdcgmUpdateAllFields
. Shall ypou document it?