Hello,
DCGM exporter container is in permanent CrashLoopBackOff
A100 is inside ESXi server, pass-through to VM. That VM is as a node in a K8s cluster.
K8s cluster is in v1.28.6
2024/11/28 08:45:24 maxprocs: Leaving GOMAXPROCS=64: CPU quota undefined
time="2024-11-28T08:45:24Z" level=info msg="Starting dcgm-exporter"
time="2024-11-28T08:45:24Z" level=info msg="DCGM successfully initialized!"
time="2024-11-28T08:45:24Z" level=info msg="Collecting DCP Metrics"
time="2024-11-28T08:45:24Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-11-28T08:45:24Z" level=info msg="Initializing system entities of type: GPU"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-11-28T08:45:25Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
Greatly appreciated any tips for this issue