I am attempting to build dcgm-exporter to run on bare metal on Ubuntu bionic. Nvidia drivers and cuda library are installed from the nvidia ppa. I followed the build instructions in the README for making the binary
.
When I attempt to run dcgm-exporter
that I built, the following is the output (from testing):
# /usr/bin/dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
ERRO[0002] Failed to collect metrics with error: Failed to collect metrics with error: Error getting device information: API version mismatch
ERRO[0004] Failed to collect metrics with error: Failed to collect metrics with error: Error getting device information: API version mismatch
ERRO[0006] Failed to collect metrics with error: Failed to collect metrics with error: Error getting device information: API version mismatch
All nvidia tools and scripts used to test gpus work properly. This tool is the only tool giving this error. Any assistance would be greatly appreciated. We have K80 Tesla cards and we’re running nvidia-440.100 driver(s). I’m using the repo master latest build of dcgm-exporter.
# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.100 Fri May 29 08:45:51 UTC 2020
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
Unfortunately, when attempting to get a dcgm-exporter
version, the output is:
# ~/dcgm-exporter --version
DCGM Exporter version Filled by the build system
We successfully use dcgm-exporter 1.7.2 in kubernetes clusters running on the same hardware. My requirement here is to get the exporter working on bare metal. I can’t find a 1.7.2 version in the repo, however.
I’m at a loss here. Any pointers would be greatly appreciated.