DCGM exporter 3.6.0 - can not gather metrics from the GA100 GPU (A100 80GB)

adrian.jastrzebski · November 28, 2024, 9:01am

Hello,

DCGM exporter container is in permanent CrashLoopBackOff

A100 is inside ESXi server, pass-through to VM. That VM is as a node in a K8s cluster.
K8s cluster is in v1.28.6

2024/11/28 08:45:24 maxprocs: Leaving GOMAXPROCS=64: CPU quota undefined
time="2024-11-28T08:45:24Z" level=info msg="Starting dcgm-exporter"
time="2024-11-28T08:45:24Z" level=info msg="DCGM successfully initialized!"
time="2024-11-28T08:45:24Z" level=info msg="Collecting DCP Metrics"
time="2024-11-28T08:45:24Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-11-28T08:45:24Z" level=info msg="Initializing system entities of type: GPU"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-11-28T08:45:25Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-11-28T08:45:25Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

Greatly appreciated any tips for this issue

Topic		Replies	Views
Dcgm-exporter[]: level=info msg="Not collecting DCP metrics: Error getting supported metrics: API version misma CUDA Setup and Installation	0	271	June 6, 2024
Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems Profiling Linux Targets profiling	12	861	June 5, 2024
Dcgm-exporter in gke doesnot gives pods,namespace,container names in metrics of mig Miscellaneous Products (archived)	0	751	August 15, 2022
Unable to get gpu metrics on Quadro GV100 Profiling Linux Targets	3	470	January 5, 2024
Dcgm-exporter API version mismatch Triton Inference Server - archived	2	1659	October 12, 2021
Can't get GPU Metrics with Nsight System Profiling Linux Targets cuda	13	267	September 6, 2024
Unable to retrieve running processes in DCGM System Management and Monitoring (NVML)	0	872	April 25, 2019
Question about GPU Operator (DCGM) relation ship? Nsight Compute	3	986	October 12, 2021
DCGM Not reporting running processes Other Tools	1	530	April 25, 2019
DCGM does not export profile metrics after some period of time Miscellaneous Products (archived)	0	2410	June 1, 2021

DCGM exporter 3.6.0 - can not gather metrics from the GA100 GPU (A100 80GB)

Related topics