GPUOperator Support on CentOS 7.8 - GLIBC_2.27

Hello, I submitted this issue in GitHub as well (https://github.com/NVIDIA/gpu-operator/issues/72). Posting here also:

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
    – No - CentOS 7.8
  • Are you running Kubernetes v1.13+?
    – Yes - v1.18.6
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
    – Yes - v19.03.12
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
    – No - apparently this is N/A now
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I’m trying to run on CentOS 7.8 with a single-node kubernetes cluster that I set up with kubeadm (no openshift).

I’m getting errors on the nvidia-driver-validation, nvidia-device-plugin-daemon-set, and nvidia-dcgm-exporter pods - each complaining that “GLIBC_2.27 not found”. It seems like it is trying to use the host glibc which on centos 7 is GLIBC_2.17.

From looking at the commits it seems that CentOS support is a recent development and perhaps there is some flag or configuration that I need to provide to run on CentOS 7 that hasn’t been documented yet.

Any ideas? Thank you!

2. Steps to reproduce the issue

  • Start with a clean install of CentOS 7.8
  • Install docker, initialize kubernetes cluster
  • Setup Helm and NVIDIA Repo
  • helm install --devel nvidia/gpu-operator --wait --generate-name

Error on all three pods (nvidia-driver-validation, nvidia-device-plugin-daemon-set, and nvidia-dcgm-exporter pods):

Error: failed to start container “cuda-vector-add”: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27’ not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\n\”””: unknown
Back-off restarting failed container

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces
$ kubectl get pods --all-namespaces
NAMESPACE                NAME                                                              READY   STATUS             RESTARTS   AGE
gpu-operator-resources   nvidia-container-toolkit-daemonset-79427                          1/1     Running            0          38m
gpu-operator-resources   nvidia-dcgm-exporter-s6gxj                                        0/1     CrashLoopBackOff   12         38m
gpu-operator-resources   nvidia-device-plugin-daemonset-bpwt7                              0/1     CrashLoopBackOff   12         38m
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Pending            0          38m
gpu-operator-resources   nvidia-driver-daemonset-kqvdt                                     0/1     CrashLoopBackOff   5          38m
gpu-operator-resources   nvidia-driver-validation                                          0/1     CrashLoopBackOff   14         38m
kube-system              calico-kube-controllers-578894d4cd-wb6bm                          1/1     Running            3          43h
kube-system              calico-node-j8qmb                                                 1/1     Running            3          43h
kube-system              coredns-66bff467f8-28q78                                          1/1     Running            3          2d18h
kube-system              coredns-66bff467f8-dg55z                                          1/1     Running            3          2d18h
kube-system              etcd-pho-test-4.mitre.org                                         1/1     Running            3          2d18h
kube-system              kube-apiserver-pho-test-4.mitre.org                               1/1     Running            3          2d18h
kube-system              kube-controller-manager-pho-test-4.mitre.org                      1/1     Running            4          2d18h
kube-system              kube-proxy-vwm9b                                                  1/1     Running            3          2d18h
kube-system              kube-scheduler-pho-test-4.mitre.org                               1/1     Running            5          2d18h
kube-system              metrics-server-f7cdcc99-mkvfb                                     1/1     Running            3          24h
kubernetes-dashboard     dashboard-metrics-scraper-6b4884c9d5-r6prr                        1/1     Running            3          43h
kubernetes-dashboard     kubernetes-dashboard-7b544877d5-c2b66                             1/1     Running            3          43h
photonapi                gpu-operator-1597413719-node-feature-discovery-master-f76b2nd78   1/1     Running            0          38m
photonapi                gpu-operator-1597413719-node-feature-discovery-worker-t84dc       1/1     Running            0          38m
photonapi                gpu-operator-774ff7994c-bptf8                                     1/1     Running            0          38m
  • kubernetes daemonset status: kubectl get ds --all-namespaces
$ kubectl get ds --all-namespaces
NAMESPACE                NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-operator-resources   nvidia-container-toolkit-daemonset                      1         1         1       1            1           feature.node.kubernetes.io/pci-10de.present=true   39m
gpu-operator-resources   nvidia-dcgm-exporter                                    1         1         0       1            0           feature.node.kubernetes.io/pci-10de.present=true   39m
gpu-operator-resources   nvidia-device-plugin-daemonset                          1         1         0       1            0           feature.node.kubernetes.io/pci-10de.present=true   39m
gpu-operator-resources   nvidia-driver-daemonset                                 1         1         0       1            0           feature.node.kubernetes.io/pci-10de.present=true   39m
kube-system              calico-node                                             1         1         1       1            1           kubernetes.io/os=linux                             43h
kube-system              kube-proxy                                              1         1         1       1            1           kubernetes.io/os=linux                             2d18h
photonapi                gpu-operator-1597413719-node-feature-discovery-worker   1         1         1       1            1           <none>                                             39m
  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

Here is the output for nvidia-driver-validation, the others are all similar:

$ kubectl describe pod -n gpu-operator-resources nvidia-driver-validation
Name:         nvidia-driver-validation
Namespace:    gpu-operator-resources
Priority:     0
Node:         pho-test-4.mitre.org/10.128.210.23
Start Time:   Fri, 14 Aug 2020 10:02:09 -0400
Labels:       app=nvidia-driver-validation
Annotations:  cni.projectcalico.org/podIP: 10.244.202.208/32
              cni.projectcalico.org/podIPs: 10.244.202.208/32
Status:       Running
IP:           10.244.202.208
IPs:
  IP:           10.244.202.208
Controlled By:  ClusterPolicy/cluster-policy
Containers:
  cuda-vector-add:
    Container ID:   docker://88a469dcded3a1afc3da4dde9a4b383965ec7e9df809bf7e96f00a1846db5202
    Image:          nvidia/samples:cuda10.2-vectorAdd
    Image ID:       docker-pullable://nvidia/samples@sha256:19f202eecbfa5211be8ce0f0dc02deb44a25a7e3584c993b22b1654f3569ffaf
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown
      Exit Code:    128
      Started:      Fri, 14 Aug 2020 10:39:16 -0400
      Finished:     Fri, 14 Aug 2020 10:39:16 -0400
    Ready:          False
    Restart Count:  14
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j58hc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-j58hc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-j58hc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  feature.node.kubernetes.io/pci-10de.present=true
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason     Age                  From                           Message
  ----     ------     ----                 ----                           -------
  Normal   Scheduled  41m                  default-scheduler              Successfully assigned gpu-operator-resources/nvidia-driver-validation to pho-test-4.mitre.org
  Normal   Pulled     39m (x5 over 41m)    kubelet, pho-test-4.mitre.org  Container image "nvidia/samples:cuda10.2-vectorAdd" already present on machine
  Normal   Created    39m (x5 over 41m)    kubelet, pho-test-4.mitre.org  Created container cuda-vector-add
  Warning  Failed     39m (x5 over 41m)    kubelet, pho-test-4.mitre.org  Error: failed to start container "cuda-vector-add": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\"": unknown
  Warning  BackOff    86s (x178 over 40m)  kubelet, pho-test-4.mitre.org  Back-off restarting failed container
  • Output of running a container on the GPU machine: docker run -it alpine echo foo
$ docker run -it alpine echo foo
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df20fa9351a1: Pull complete 
Digest: sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
Status: Downloaded newer image for alpine:latest
foo
  • Docker configuration file: cat /etc/docker/daemon.json
$ cat /etc/docker/daemon.json
{
  "exec-opts": [
    "native.cgroupdriver=systemd"
  ],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ],
  "runtimes": {
    "nvidia": {
      "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
    }
  },
  "default-runtime": "nvidia"
}
  • NVIDIA shared directory: ls -la /run/nvidia
ls -la /run/nvidia
total 8
drwxr-xr-x.  2 root root  80 Aug 14 10:42 .
drwxr-xr-x. 32 root root 980 Aug 14 02:06 ..
-rw-r--r--.  1 root root   5 Aug 14 10:42 nvidia-driver.pid
-rw-r--r--.  1 root root   5 Aug 14 10:02 toolkit.pid
  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
ls -la /usr/local/nvidia/toolkit
total 3992
drwxr-xr-x. 3 root root    4096 Aug 14 10:02 .
drwxr-xr-x. 3 root root      21 Aug 14 10:02 ..
drwxr-xr-x. 3 root root      38 Aug 14 10:02 .config
lrwxrwxrwx. 1 root root      30 Aug 14 10:02 libnvidia-container.so.1 -> ./libnvidia-container.so.1.0.7
-rwxr-xr-x. 1 root root  151088 Aug 14 10:02 libnvidia-container.so.1.0.7
-rwxr-xr-x. 1 root root     154 Aug 14 10:02 nvidia-container-cli
-rwxr-xr-x. 1 root root   34832 Aug 14 10:02 nvidia-container-cli.real
-rwxr-xr-x. 1 root root     166 Aug 14 10:02 nvidia-container-runtime
lrwxrwxrwx. 1 root root      26 Aug 14 10:02 nvidia-container-runtime-hook -> ./nvidia-container-toolkit
-rwxr-xr-x. 1 root root 2008936 Aug 14 10:02 nvidia-container-runtime.real
-rwxr-xr-x. 1 root root     195 Aug 14 10:02 nvidia-container-toolkit
-rwxr-xr-x. 1 root root 1871848 Aug 14 10:02 nvidia-container-toolkit.real
  • NVIDIA driver directory: ls -la /run/nvidia/driver
ls -la /run/nvidia/driver
ls: cannot access /run/nvidia/driver: No such file or directory