Completely purge and reinstall nvidia gpu operator

Thanks a lot @Morganh!! highly appriciate the level of support!!

I followed the steps in the file you uploaded, please check below!

Pre checks

listing the chart

g@gsrv:~$ helm ls -n gpu-operator
NAME                    NAMESPACE     REVISION  UPDATED                                 STATUS    CHART                 APP VERSION
gpu-operator-1692720059 gpu-operator  1         2023-08-22 16:01:04.446737327 +0000 UTC deployed  gpu-operator-v23.6.0  v23.6.0 

checking pods

g@gsrv:~$  kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS        AGE
gpu-feature-discovery-hg6vw                                       0/1     Init:0/1                0               16h
gpu-operator-1692720059-node-feature-discovery-master-74b78zmhw   1/1     Running                 0               16h
gpu-operator-1692720059-node-feature-discovery-worker-7lqbt       1/1     Running                 0               16h
gpu-operator-1692720059-node-feature-discovery-worker-rxqvv       1/1     Running                 0               16h
gpu-operator-7b8668c994-kccdk                                     1/1     Running                 0               16h
nvidia-dcgm-exporter-w58h4                                        0/1     Init:0/1                0               16h
nvidia-device-plugin-daemonset-mc9fp                              0/1     Init:0/1                0               16h
nvidia-mig-manager-rvf28                                          0/1     Init:0/1                0               16h
nvidia-operator-validator-vt9r2                                   0/1     Init:CrashLoopBackOff   193 (69s ago)   16h

checking clusterroles

g@gsrv:~$  kubectl get clusterroles | grep gpu
gpu-operator                                                           2023-08-22T16:01:07Z
gpu-operator-1692635184-node-feature-discovery                         2023-08-21T16:26:30Z   (Note: could this may be the cause because we have two of these running?)
gpu-operator-1692720059-node-feature-discovery                         2023-08-22T16:01:07Z
nvidia-gpu-feature-discovery                                           2023-08-22T16:01:26Z

checking clusterrolebindings

g@gsrv:~$ kubectl get clusterrolebinding | grep gpu
gpu-operator                                                      ClusterRole/gpu-operator                                                           16h
gpu-operator-1692720059-node-feature-discovery                    ClusterRole/gpu-operator-1692720059-node-feature-discovery                         16h
gpu-operator-1692720059-node-feature-discovery-topology-updater   ClusterRole/gpu-operator-1692720059-node-feature-discovery-topology-updater        16h
nvidia-gpu-feature-discovery                                      ClusterRole/nvidia-gpu-feature-discovery                                           16h

checking deployments, daemonsets and crds

g@gsrv:~$ kubectl get deployments -A
NAMESPACE          NAME                                                    READY   UP-TO-DATE   AVAILABLE   AGE
calico-apiserver   calico-apiserver                                        2/2     2            2           13d
calico-system      calico-kube-controllers                                 1/1     1            1           13d
calico-system      calico-typha                                            1/1     1            1           13d
clearml            clearml-apiserver                                       1/1     1            1           12d
clearml            clearml-fileserver                                      1/1     1            1           12d
clearml            clearml-mongodb                                         1/1     1            1           12d
clearml            clearml-webserver                                       1/1     1            1           12d
gpu-operator       gpu-operator                                            1/1     1            1           16h
gpu-operator       gpu-operator-1692720059-node-feature-discovery-master   1/1     1            1           16h
k8-storage         nfs-subdir-external-provisioner                         1/1     1            1           12d
kube-system        coredns                                                 2/2     2            2           13d
nuclio             nuclio-controller                                       1/1     1            1           11d
nuclio             nuclio-dashboard                                        1/1     1            1           11d
nuclio             nuclio-test-nuctl-function-1                            1/1     1            1           5d18h   (Note: This is still workking because this just needs cpu)
nuclio             nuclio-test-nuctl-function-2-retinanet                  0/1     1            0           4d23h   (Note: This is not working because this needs gpu)
tao-gnet           ingress-nginx-controller                                1/1     1            1           41h
tao-gnet           tao-toolkit-api-app-pod                                 1/1     1            1           41h
tao-gnet           tao-toolkit-api-workflow-pod                            1/1     1            1           41h
tigera-operator    tigera-operator                                         1/1     1            1           13d




g@gsrv:~$ kubectl get daemonsets -A
NAMESPACE       NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
calico-system   calico-node                                             2         2         2       2            2           kubernetes.io/os=linux                             13d
calico-system   csi-node-driver                                         2         2         2       2            2           kubernetes.io/os=linux                             13d
gpu-operator    gpu-feature-discovery                                   1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   16h
gpu-operator    gpu-operator-1692720059-node-feature-discovery-worker   2         2         2       2            2           <none>                                             16h
gpu-operator    nvidia-dcgm-exporter                                    1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           16h
gpu-operator    nvidia-device-plugin-daemonset                          1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           16h
gpu-operator    nvidia-mig-manager                                      1         1         0       1            0           nvidia.com/gpu.deploy.mig-manager=true             16h
gpu-operator    nvidia-operator-validator                               1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      16h
kube-system     kube-proxy                                              2         2         2       2            2           kubernetes.io/os=linux                             13d


g@gsrv:~$ kubectl get crd 
NAME                                                  CREATED AT
apiservers.operator.tigera.io                         2023-08-09T17:40:28Z
bgpconfigurations.crd.projectcalico.org               2023-08-09T17:40:27Z
bgpfilters.crd.projectcalico.org                      2023-08-09T17:40:28Z
bgppeers.crd.projectcalico.org                        2023-08-09T17:40:28Z
blockaffinities.crd.projectcalico.org                 2023-08-09T17:40:28Z
caliconodestatuses.crd.projectcalico.org              2023-08-09T17:40:28Z
clusterinformations.crd.projectcalico.org             2023-08-09T17:40:28Z
clusterpolicies.nvidia.com                            2023-08-22T16:01:02Z
felixconfigurations.crd.projectcalico.org             2023-08-09T17:40:28Z
globalnetworkpolicies.crd.projectcalico.org           2023-08-09T17:40:28Z
globalnetworksets.crd.projectcalico.org               2023-08-09T17:40:28Z
hostendpoints.crd.projectcalico.org                   2023-08-09T17:40:28Z
imagesets.operator.tigera.io                          2023-08-09T17:40:28Z
installations.operator.tigera.io                      2023-08-09T17:40:28Z
ipamblocks.crd.projectcalico.org                      2023-08-09T17:40:28Z
ipamconfigs.crd.projectcalico.org                     2023-08-09T17:40:28Z
ipamhandles.crd.projectcalico.org                     2023-08-09T17:40:28Z
ippools.crd.projectcalico.org                         2023-08-09T17:40:28Z
ipreservations.crd.projectcalico.org                  2023-08-09T17:40:28Z
kubecontrollersconfigurations.crd.projectcalico.org   2023-08-09T17:40:28Z
networkpolicies.crd.projectcalico.org                 2023-08-09T17:40:28Z
networksets.crd.projectcalico.org                     2023-08-09T17:40:28Z
nodefeaturerules.nfd.k8s-sigs.io                      2023-08-22T09:50:50Z
nodefeatures.nfd.k8s-sigs.io                          2023-08-09T17:53:28Z
nuclioapigateways.nuclio.io                           2023-08-11T11:08:55Z
nucliofunctionevents.nuclio.io                        2023-08-11T11:08:55Z
nucliofunctions.nuclio.io                             2023-08-11T11:08:55Z
nuclioprojects.nuclio.io                              2023-08-11T11:08:55Z
tigerastatuses.operator.tigera.io                     2023-08-09T17:40:29Z

Deleting

finding the generated chart name

g@gsrv:~$ helm list -n gpu-operator | grep gpu | awk '{print $1}'
gpu-operator-1692720059

deleting the chart

g@gsrv:~$ helm delete gpu-operator-1692720059
Error: uninstall: Release not loaded: gpu-operator-1692720059: release: not found
g@gsrv:~$ helm delete gpu-operator-1692720059

because I saw artifacts from gpu-operator-1692635184 I tried deleting it again

helm delete  gpu-operator-1692635184 -n gpu-operator                                   
Error: uninstall: Release not loaded: gpu-operator-1692635184: release: not found

deleting the stray clusterrole I found earlier to make sure it is all gone.

g@gsrv:~$ kubectl delete clusterroles gpu-operator-1692635184-node-feature-discovery
clusterrole.rbac.authorization.k8s.io "gpu-operator-1692635184-node-feature-discovery" deleted

uninstalling (but using the delete command as you’ve done)

helm delete gpu-operator-1692720059 -n gpu-operator
release "gpu-operator-1692720059" uninstalled

deleting the crd

g@gsrv:~$ kubectl delete crd clusterpolicies.nvidia.com
customresourcedefinition.apiextensions.k8s.io "clusterpolicies.nvidia.com" deleted

a check

g@gsrv:~$ helm uninstall --wait  gpu-operator-1692635184 -n gpu-operator
Error: uninstall: Release not loaded: gpu-operator-1692635184: release: not found
g@gsrv:~$ helm uninstall --wait gpu-operator-1692720059 -n gpu-operator
Error: uninstall: Release not loaded: gpu-operator-1692720059: release: not found

Then I even deleted the namespace just to be sure it’s all gone

g@gsrv:~$ kubectl delete namespace gpu-operator
namespace "gpu-operator" deleted

Checking it is all gone.

g@gsrv:~$  helm ls -n gpu-operator
NAME  NAMESPACE REVISION  UPDATED STATUS  CHART APP VERSION


g@gsrv:~$ kubectl get clusterroles | grep gpu
g@gsrv:~$ kubectl get clusterrolebinding | grep gpu
g@gsrv:~$ kubectl get clusterrolebinding | grep nv
g@gsrv:~$ kubectl get clusterroles | grep nv
g@gsrv:~$ kubectl get crd 
NAME                                                  CREATED AT
apiservers.operator.tigera.io                         2023-08-09T17:40:28Z
bgpconfigurations.crd.projectcalico.org               2023-08-09T17:40:27Z
bgpfilters.crd.projectcalico.org                      2023-08-09T17:40:28Z
bgppeers.crd.projectcalico.org                        2023-08-09T17:40:28Z
blockaffinities.crd.projectcalico.org                 2023-08-09T17:40:28Z
caliconodestatuses.crd.projectcalico.org              2023-08-09T17:40:28Z
clusterinformations.crd.projectcalico.org             2023-08-09T17:40:28Z
felixconfigurations.crd.projectcalico.org             2023-08-09T17:40:28Z
globalnetworkpolicies.crd.projectcalico.org           2023-08-09T17:40:28Z
globalnetworksets.crd.projectcalico.org               2023-08-09T17:40:28Z
hostendpoints.crd.projectcalico.org                   2023-08-09T17:40:28Z
imagesets.operator.tigera.io                          2023-08-09T17:40:28Z
installations.operator.tigera.io                      2023-08-09T17:40:28Z
ipamblocks.crd.projectcalico.org                      2023-08-09T17:40:28Z
ipamconfigs.crd.projectcalico.org                     2023-08-09T17:40:28Z
ipamhandles.crd.projectcalico.org                     2023-08-09T17:40:28Z
ippools.crd.projectcalico.org                         2023-08-09T17:40:28Z
ipreservations.crd.projectcalico.org                  2023-08-09T17:40:28Z
kubecontrollersconfigurations.crd.projectcalico.org   2023-08-09T17:40:28Z
networkpolicies.crd.projectcalico.org                 2023-08-09T17:40:28Z
networksets.crd.projectcalico.org                     2023-08-09T17:40:28Z
nodefeaturerules.nfd.k8s-sigs.io                      2023-08-22T09:50:50Z
nodefeatures.nfd.k8s-sigs.io                          2023-08-09T17:53:28Z
nuclioapigateways.nuclio.io                           2023-08-11T11:08:55Z
nucliofunctionevents.nuclio.io                        2023-08-11T11:08:55Z
nucliofunctions.nuclio.io                             2023-08-11T11:08:55Z
nuclioprojects.nuclio.io                              2023-08-11T11:08:55Z
tigerastatuses.operator.tigera.io                     2023-08-09T17:40:29Z

Then I shut down the dgx (to unload the drivers if they were lingering about)

Reinstalling the chart

  1. Turned the DGX back on
  2. Waited the cluster (approx 15 minutes) to do negotiaions (in case it needed some time for that)

make sure everything is up and running

g@dgx:~$  kubectl get pods -A 
NAMESPACE          NAME                                                      READY   STATUS    RESTARTS       AGE
calico-apiserver   calico-apiserver-db54b987d-m66zz                          1/1     Running   0              13d
calico-apiserver   calico-apiserver-db54b987d-ncspt                          1/1     Running   0              13d
calico-system      calico-kube-controllers-666f5dcd4d-kj7fs                  1/1     Running   0              13d
calico-system      calico-node-j2ljx                                         1/1     Running   6 (27m ago)    13d
calico-system      calico-node-trx99                                         1/1     Running   0              13d
calico-system      calico-typha-585d9c9df4-x9c6k                             1/1     Running   0              13d
calico-system      csi-node-driver-slh5f                                     2/2     Running   0              13d
calico-system      csi-node-driver-wf8n9                                     2/2     Running   12 (27m ago)   13d
clearml            clearml-apiserver-76ff97d7f7-wcn6v                        1/1     Running   0              21m
clearml            clearml-elastic-master-0                                  1/1     Running   0              15m
clearml            clearml-fileserver-ff756c4b8-fk59x                        1/1     Running   0              21m
clearml            clearml-mongodb-5f9468969b-bmc6s                          1/1     Running   0              21m
clearml            clearml-redis-master-0                                    1/1     Running   0              15m
clearml            clearml-webserver-7f5fb5df5d-qpkbl                        1/1     Running   0              21m
default            cuda-vectoradd                                            0/1     Pending   0              21h
default            gpu-test-job-v2zxg                                        0/1     Pending   0              42h
k8-storage         nfs-subdir-external-provisioner-5669cc5b6-77gz5           1/1     Running   1 (15m ago)    21m
kube-system        coredns-57575c5f89-9flb2                                  1/1     Running   0              13d
kube-system        coredns-57575c5f89-nrd5f                                  1/1     Running   0              13d
kube-system        etcd-gsrv                                                 1/1     Running   0              13d
kube-system        kube-apiserver-gsrv                                       1/1     Running   0              13d
kube-system        kube-controller-manager-gsrv                              1/1     Running   0              13d
kube-system        kube-proxy-tzhrp                                          1/1     Running   6 (27m ago)    13d
kube-system        kube-proxy-z4hxr                                          1/1     Running   0              13d
kube-system        kube-scheduler-gsrv                                       1/1     Running   0              13d
nuclio             nuclio-controller-679c44dcdc-nsmtm                        1/1     Running   0              21m
nuclio             nuclio-dashboard-6496cdfd66-ktnkb                         1/1     Running   1 (15m ago)    21m
nuclio             nuclio-test-nuctl-function-1-84b6bd65bd-wz9hm             1/1     Running   0              21m
nuclio             nuclio-test-nuctl-function-2-retinanet-7d8545d7db-fv87n   0/1     Pending   0              41h
tao-gnet           ingress-nginx-controller-78d54fbd-g9nrm                   1/1     Running   0              21m
tao-gnet           tao-toolkit-api-app-pod-5ffc48cd57-7d4mx                  1/1     Running   0              21m
tao-gnet           tao-toolkit-api-workflow-pod-6dbc7c8f98-5dl8n             1/1     Running   0              21m
tigera-operator    tigera-operator-959786749-ctprw                           1/1     Running   0              13d

Then I reinstlled the chart! (This time I used a new namespace gpu-operator-nvidia instead the previous value gpu-operator to be extra safe)

helm install --wait --generate-name \
     -n gpu-operator-nvidia --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false

Still no joy!! :(

then I looked for logs

  1. find all pods in the namespace
g@gsrv:~$ kubectl get pods -n gpu-operator-nvidia
NAME                                                              READY   STATUS                  RESTARTS        AGE
gpu-feature-discovery-c24v4                                       0/1     Init:0/1                0               18m
gpu-operator-1692786693-node-feature-discovery-master-f947hcgkb   1/1     Running                 0               18m
gpu-operator-1692786693-node-feature-discovery-worker-flhl8       1/1     Running                 0               18m
gpu-operator-1692786693-node-feature-discovery-worker-thvxj       1/1     Running                 0               18m
gpu-operator-5747c5f6db-2cvht                                     1/1     Running                 0               18m
nvidia-dcgm-exporter-zcxbb                                        0/1     Init:0/1                0               18m
nvidia-device-plugin-daemonset-mqdxh                              0/1     Init:0/1                0               18m
nvidia-mig-manager-5n7sl                                          0/1     Init:0/1                0               18m
nvidia-operator-validator-ztgbb                                   0/1     Init:CrashLoopBackOff   8 (2m51s ago)   18m

Then got logs for all pods starting from the running ones

logs from gpu-operator-1692786693-node-feature-discovery-master-f947hcgkb

g@gsrv:~$ kubectl logs  -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-master-f947hcgkb
W0823 10:31:41.346829       1 main.go:56] -featurerules-controller is deprecated, use '-crd-controller' flag instead
I0823 10:31:41.347026       1 nfd-master.go:181] Node Feature Discovery Master v0.13.1
I0823 10:31:41.347038       1 nfd-master.go:185] NodeName: "gsrv"
I0823 10:31:41.347048       1 nfd-master.go:186] Kubernetes namespace: "gpu-operator-nvidia"
I0823 10:31:41.347112       1 nfd-master.go:1091] config file "/etc/kubernetes/node-feature-discovery/nfd-master.conf" not found, using defaults
I0823 10:31:41.347303       1 nfd-master.go:1145] master (re-)configuration successfully completed
I0823 10:31:41.347319       1 nfd-master.go:202] starting nfd api controller
I0823 10:31:41.376543       1 component.go:36] [core][Server #1] Server created
I0823 10:31:41.376576       1 nfd-master.go:292] gRPC server serving on port: 8080
I0823 10:31:41.376641       1 component.go:36] [core][Server #1 ListenSocket #2] ListenSocket created
I0823 10:31:42.375973       1 nfd-master.go:601] will process all nodes in the cluster

not sure config file "/etc/kubernetes/node-feature-discovery/nfd-master.conf" not found, using defaults is a problem as mentioned in the quote below from topic 226781?

I checked the /etc/kubernetes/node-feature-discovery directores in the two nodes

gsrv (master node)

g@gsrv:~$ tree  /etc/kubernetes/node-feature-discovery
/etc/kubernetes/node-feature-discovery
├── features.d
└── source.d

2 directories, 0 files

dgx (gpu node)

g@dgx:~$ tree  /etc/kubernetes/node-feature-discovery
/etc/kubernetes/node-feature-discovery
├── features.d
└── source.d

2 directories, 0 files

maybe the nfd-master.conf got deleted or fomr some config error it is looking for a file that never existed?

However when I check for configmaps there seems to be one created with the generated name.

g@gsrv:~$ kubectl get configmap -n gpu-operator-nvidia
NAME                                                                   DATA   AGE
default-gpu-clients                                                    1      70m
default-mig-parted-config                                              1      70m
gpu-operator-1692786693-node-feature-discovery-master-conf             1      70m
gpu-operator-1692786693-node-feature-discovery-topology-updater-conf   1      70m
gpu-operator-1692786693-node-feature-discovery-worker-conf             1      70m
kube-root-ca.crt                                                       1      70m
nvidia-device-plugin-entrypoint                                        1      70m
nvidia-mig-manager-entrypoint                                          1      70m

when I run kubectl edit configmap -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-master-conf It opens up for editing. so maybe this is not an issue?

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  nfd-master.conf: |-
    extraLabelNs:
    - nvidia.com
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: gpu-operator-1692786693
    meta.helm.sh/release-namespace: gpu-operator-nvidia
  creationTimestamp: "2023-08-23T10:31:39Z"
  labels:
    app.kubernetes.io/instance: gpu-operator-1692786693
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/version: v0.13.1
    helm.sh/chart: node-feature-discovery-0.13.1
  name: gpu-operator-1692786693-node-feature-discovery-master-conf
  namespace: gpu-operator-nvidia
  resourceVersion: "3422257"
  uid: 37c2b25f-3b04-4028-a8be-f95a353544fc                                     

logs from gpu-operator-1692786693-node-feature-discovery-worker-flhl8

g@gsrv:~$ kubectl logs  -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-worker-flhl8 
I0823 10:31:41.387093       1 nfd-worker.go:222] Node Feature Discovery Worker v0.13.1
I0823 10:31:41.387150       1 nfd-worker.go:223] NodeName: 'gsrv'
I0823 10:31:41.387158       1 nfd-worker.go:224] Kubernetes namespace: 'gpu-operator-nvidia'
I0823 10:31:41.387779       1 nfd-worker.go:518] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0823 10:31:41.387911       1 nfd-worker.go:550] worker (re-)configuration successfully completed
I0823 10:31:41.411725       1 local.go:115] starting hooks...
I0823 10:31:41.445933       1 nfd-worker.go:561] starting feature discovery...
I0823 10:31:41.446421       1 nfd-worker.go:573] feature discovery completed
I0823 10:31:41.466469       1 nfd-worker.go:694] creating NodeFeature object "gsrv"
I0823 10:32:41.423353       1 local.go:115] starting hooks...
I0823 10:32:41.484458       1 nfd-worker.go:561] starting feature discovery...
I0823 10:32:41.485156       1 nfd-worker.go:573] feature discovery completed
...
...
I0823 10:49:41.430477       1 local.go:115] starting hooks...
I0823 10:49:41.479448       1 nfd-worker.go:561] starting feature discovery...
I0823 10:49:41.480000       1 nfd-worker.go:573] feature discovery completed
...
...

logs from gpu-operator-1692786693-node-feature-discovery-worker-thvxj

g@gsrv:~$ kubectl logs  -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-worker-thvxj
I0823 10:31:40.964623       1 nfd-worker.go:222] Node Feature Discovery Worker v0.13.1
I0823 10:31:40.964651       1 nfd-worker.go:223] NodeName: 'dgx'
I0823 10:31:40.964655       1 nfd-worker.go:224] Kubernetes namespace: 'gpu-operator-nvidia'
I0823 10:31:40.966036       1 nfd-worker.go:518] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0823 10:31:40.966088       1 nfd-worker.go:550] worker (re-)configuration successfully completed
I0823 10:31:40.975638       1 local.go:115] starting hooks...
I0823 10:31:40.993071       1 nfd-worker.go:561] starting feature discovery...
I0823 10:31:40.993439       1 nfd-worker.go:573] feature discovery completed
I0823 10:31:41.010433       1 nfd-worker.go:694] creating NodeFeature object "dgx"
I0823 10:32:41.000681       1 local.go:115] starting hooks...
I0823 10:32:41.017481       1 nfd-worker.go:561] starting feature discovery...
I0823 10:32:41.017801       1 nfd-worker.go:573] feature discovery completed
...
...
I0823 10:40:41.002458       1 local.go:115] starting hooks...
I0823 10:40:41.016863       1 nfd-worker.go:561] starting feature discovery...
I0823 10:40:41.017193       1 nfd-worker.go:573] feature discovery completed
...
...

logs from gpu-operator-5747c5f6db-2cvht

gpu-operator-5747c5f6db-2cvht.txt (4.7 MB)

logs from nvidia-dcgm-exporter-zcxbb

g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-dcgm-exporter-zcxbb
Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
Error from server (BadRequest): container "nvidia-dcgm-exporter" in pod "nvidia-dcgm-exporter-zcxbb" is waiting to start: PodInitializing

logs from nvidia-device-plugin-daemonset-mqdxh

g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-device-plugin-daemonset-mqdxh
Defaulted container "nvidia-device-plugin" out of: nvidia-device-plugin, toolkit-validation (init)
Error from server (BadRequest): container "nvidia-device-plugin" in pod "nvidia-device-plugin-daemonset-mqdxh" is waiting to start: PodInitializing

logs from nvidia-mig-manager-5n7sl

g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-mig-manager-5n7sl
Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init)
Error from server (BadRequest): container "nvidia-mig-manager" in pod "nvidia-mig-manager-5n7sl" is waiting to start: PodInitializing

logs from nvidia-operator-validator-ztgbb

g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-operator-validator-ztgbb
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-ztgbb" is waiting to start: PodInitializing

Could us be having different experiences because the topologies are different? or could the “missing” nfd-master.conf could be the cause of the problem?

I can refresh the cluster and see if a such file gets created?

or is there a way to drill in to the nvidia-operator-validator-ztgbb or nvidia operator validator semantics to communicate what stage of operator validation is failing?

in the dgx (gpu node) /run/nvidia directory is empty

g@dgx:~$ tree /run/nvidia/
/run/nvidia/
├── driver
└── validations

2 directories, 0 files

I tred deleting the /etc/kubernetes/node-feature-discovery folder and reinstalling gpu-operator but it still failed to fix the issue (I attached the log below) Maybe the operator validater issue is not connected to that!
deleting-node-feature-discovery-folder.txt (17.7 KB)

logs for the sate of the cluster after the last attempt (23/08/2023).
logs_from_final_attempt_23_08_23.txt (44.0 KB)