Hello, I submitted this issue in GitHub as well (CentOS 7.8 Support - GLIBC_2.27 · Issue #72 · NVIDIA/gpu-operator · GitHub). Posting here also:
1. Quick Debug Checklist
-
Are you running on an Ubuntu 18.04 node?
– No - CentOS 7.8 -
Are you running Kubernetes v1.13+?
– Yes - v1.18.6 -
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
– Yes - v19.03.12 -
Do you have
i2c_core
andipmi_msghandler
loaded on the nodes?
– No - apparently this is N/A now -
Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
I’m trying to run on CentOS 7.8 with a single-node kubernetes cluster that I set up with kubeadm (no openshift).
I’m getting errors on the nvidia-driver-validation, nvidia-device-plugin-daemon-set, and nvidia-dcgm-exporter pods - each complaining that “GLIBC_2.27 not found”. It seems like it is trying to use the host glibc which on centos 7 is GLIBC_2.17.
From looking at the commits it seems that CentOS support is a recent development and perhaps there is some flag or configuration that I need to provide to run on CentOS 7 that hasn’t been documented yet.
Any ideas? Thank you!
2. Steps to reproduce the issue
- Start with a clean install of CentOS 7.8
- Install docker, initialize kubernetes cluster
- Setup Helm and NVIDIA Repo
-
helm install --devel nvidia/gpu-operator --wait --generate-name
Error on all three pods (nvidia-driver-validation, nvidia-device-plugin-daemon-set, and nvidia-dcgm-exporter pods):
Error: failed to start container “cuda-vector-add”: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27’ not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\n\""”: unknown
Back-off restarting failed container
3. Information to attach (optional if deemed irrelevant)
-
kubernetes pods status:
kubectl get pods --all-namespaces
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources nvidia-container-toolkit-daemonset-79427 1/1 Running 0 38m
gpu-operator-resources nvidia-dcgm-exporter-s6gxj 0/1 CrashLoopBackOff 12 38m
gpu-operator-resources nvidia-device-plugin-daemonset-bpwt7 0/1 CrashLoopBackOff 12 38m
gpu-operator-resources nvidia-device-plugin-validation 0/1 Pending 0 38m
gpu-operator-resources nvidia-driver-daemonset-kqvdt 0/1 CrashLoopBackOff 5 38m
gpu-operator-resources nvidia-driver-validation 0/1 CrashLoopBackOff 14 38m
kube-system calico-kube-controllers-578894d4cd-wb6bm 1/1 Running 3 43h
kube-system calico-node-j8qmb 1/1 Running 3 43h
kube-system coredns-66bff467f8-28q78 1/1 Running 3 2d18h
kube-system coredns-66bff467f8-dg55z 1/1 Running 3 2d18h
kube-system etcd-pho-test-4.mitre.org 1/1 Running 3 2d18h
kube-system kube-apiserver-pho-test-4.mitre.org 1/1 Running 3 2d18h
kube-system kube-controller-manager-pho-test-4.mitre.org 1/1 Running 4 2d18h
kube-system kube-proxy-vwm9b 1/1 Running 3 2d18h
kube-system kube-scheduler-pho-test-4.mitre.org 1/1 Running 5 2d18h
kube-system metrics-server-f7cdcc99-mkvfb 1/1 Running 3 24h
kubernetes-dashboard dashboard-metrics-scraper-6b4884c9d5-r6prr 1/1 Running 3 43h
kubernetes-dashboard kubernetes-dashboard-7b544877d5-c2b66 1/1 Running 3 43h
photonapi gpu-operator-1597413719-node-feature-discovery-master-f76b2nd78 1/1 Running 0 38m
photonapi gpu-operator-1597413719-node-feature-discovery-worker-t84dc 1/1 Running 0 38m
photonapi gpu-operator-774ff7994c-bptf8 1/1 Running 0 38m
-
kubernetes daemonset status:
kubectl get ds --all-namespaces
$ kubectl get ds --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-operator-resources nvidia-container-toolkit-daemonset 1 1 1 1 1 feature.node.kubernetes.io/pci-10de.present=true 39m
gpu-operator-resources nvidia-dcgm-exporter 1 1 0 1 0 feature.node.kubernetes.io/pci-10de.present=true 39m
gpu-operator-resources nvidia-device-plugin-daemonset 1 1 0 1 0 feature.node.kubernetes.io/pci-10de.present=true 39m
gpu-operator-resources nvidia-driver-daemonset 1 1 0 1 0 feature.node.kubernetes.io/pci-10de.present=true 39m
kube-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 43h
kube-system kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 2d18h
photonapi gpu-operator-1597413719-node-feature-discovery-worker 1 1 1 1 1 <none> 39m
-
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
Here is the output for nvidia-driver-validation, the others are all similar:
$ kubectl describe pod -n gpu-operator-resources nvidia-driver-validation
Name: nvidia-driver-validation
Namespace: gpu-operator-resources
Priority: 0
Node: pho-test-4.mitre.org/10.128.210.23
Start Time: Fri, 14 Aug 2020 10:02:09 -0400
Labels: app=nvidia-driver-validation
Annotations: cni.projectcalico.org/podIP: 10.244.202.208/32
cni.projectcalico.org/podIPs: 10.244.202.208/32
Status: Running
IP: 10.244.202.208
IPs:
IP: 10.244.202.208
Controlled By: ClusterPolicy/cluster-policy
Containers:
cuda-vector-add:
Container ID: docker://88a469dcded3a1afc3da4dde9a4b383965ec7e9df809bf7e96f00a1846db5202
Image: nvidia/samples:cuda10.2-vectorAdd
Image ID: docker-pullable://nvidia/samples@sha256:19f202eecbfa5211be8ce0f0dc02deb44a25a7e3584c993b22b1654f3569ffaf
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown
Exit Code: 128
Started: Fri, 14 Aug 2020 10:39:16 -0400
Finished: Fri, 14 Aug 2020 10:39:16 -0400
Ready: False
Restart Count: 14
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-j58hc (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-j58hc:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-j58hc
Optional: false
QoS Class: BestEffort
Node-Selectors: feature.node.kubernetes.io/pci-10de.present=true
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 41m default-scheduler Successfully assigned gpu-operator-resources/nvidia-driver-validation to pho-test-4.mitre.org
Normal Pulled 39m (x5 over 41m) kubelet, pho-test-4.mitre.org Container image "nvidia/samples:cuda10.2-vectorAdd" already present on machine
Normal Created 39m (x5 over 41m) kubelet, pho-test-4.mitre.org Created container cuda-vector-add
Warning Failed 39m (x5 over 41m) kubelet, pho-test-4.mitre.org Error: failed to start container "cuda-vector-add": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown
Warning BackOff 86s (x178 over 40m) kubelet, pho-test-4.mitre.org Back-off restarting failed container
-
Output of running a container on the GPU machine:
docker run -it alpine echo foo
$ docker run -it alpine echo foo
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df20fa9351a1: Pull complete
Digest: sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
Status: Downloaded newer image for alpine:latest
foo
-
Docker configuration file:
cat /etc/docker/daemon.json
$ cat /etc/docker/daemon.json
{
"exec-opts": [
"native.cgroupdriver=systemd"
],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
],
"runtimes": {
"nvidia": {
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
}
},
"default-runtime": "nvidia"
}
-
NVIDIA shared directory:
ls -la /run/nvidia
ls -la /run/nvidia
total 8
drwxr-xr-x. 2 root root 80 Aug 14 10:42 .
drwxr-xr-x. 32 root root 980 Aug 14 02:06 ..
-rw-r--r--. 1 root root 5 Aug 14 10:42 nvidia-driver.pid
-rw-r--r--. 1 root root 5 Aug 14 10:02 toolkit.pid
-
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
ls -la /usr/local/nvidia/toolkit
total 3992
drwxr-xr-x. 3 root root 4096 Aug 14 10:02 .
drwxr-xr-x. 3 root root 21 Aug 14 10:02 ..
drwxr-xr-x. 3 root root 38 Aug 14 10:02 .config
lrwxrwxrwx. 1 root root 30 Aug 14 10:02 libnvidia-container.so.1 -> ./libnvidia-container.so.1.0.7
-rwxr-xr-x. 1 root root 151088 Aug 14 10:02 libnvidia-container.so.1.0.7
-rwxr-xr-x. 1 root root 154 Aug 14 10:02 nvidia-container-cli
-rwxr-xr-x. 1 root root 34832 Aug 14 10:02 nvidia-container-cli.real
-rwxr-xr-x. 1 root root 166 Aug 14 10:02 nvidia-container-runtime
lrwxrwxrwx. 1 root root 26 Aug 14 10:02 nvidia-container-runtime-hook -> ./nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2008936 Aug 14 10:02 nvidia-container-runtime.real
-rwxr-xr-x. 1 root root 195 Aug 14 10:02 nvidia-container-toolkit
-rwxr-xr-x. 1 root root 1871848 Aug 14 10:02 nvidia-container-toolkit.real
-
NVIDIA driver directory:
ls -la /run/nvidia/driver
ls -la /run/nvidia/driver
ls: cannot access /run/nvidia/driver: No such file or directory