Thanks a lot @Morganh!! highly appriciate the level of support!!
I followed the steps in the file you uploaded, please check below!
Pre checks
listing the chart
g@gsrv:~$ helm ls -n gpu-operator
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1692720059 gpu-operator 1 2023-08-22 16:01:04.446737327 +0000 UTC deployed gpu-operator-v23.6.0 v23.6.0
checking pods
g@gsrv:~$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-hg6vw 0/1 Init:0/1 0 16h
gpu-operator-1692720059-node-feature-discovery-master-74b78zmhw 1/1 Running 0 16h
gpu-operator-1692720059-node-feature-discovery-worker-7lqbt 1/1 Running 0 16h
gpu-operator-1692720059-node-feature-discovery-worker-rxqvv 1/1 Running 0 16h
gpu-operator-7b8668c994-kccdk 1/1 Running 0 16h
nvidia-dcgm-exporter-w58h4 0/1 Init:0/1 0 16h
nvidia-device-plugin-daemonset-mc9fp 0/1 Init:0/1 0 16h
nvidia-mig-manager-rvf28 0/1 Init:0/1 0 16h
nvidia-operator-validator-vt9r2 0/1 Init:CrashLoopBackOff 193 (69s ago) 16h
checking clusterroles
g@gsrv:~$ kubectl get clusterroles | grep gpu
gpu-operator 2023-08-22T16:01:07Z
gpu-operator-1692635184-node-feature-discovery 2023-08-21T16:26:30Z (Note: could this may be the cause because we have two of these running?)
gpu-operator-1692720059-node-feature-discovery 2023-08-22T16:01:07Z
nvidia-gpu-feature-discovery 2023-08-22T16:01:26Z
checking clusterrolebindings
g@gsrv:~$ kubectl get clusterrolebinding | grep gpu
gpu-operator ClusterRole/gpu-operator 16h
gpu-operator-1692720059-node-feature-discovery ClusterRole/gpu-operator-1692720059-node-feature-discovery 16h
gpu-operator-1692720059-node-feature-discovery-topology-updater ClusterRole/gpu-operator-1692720059-node-feature-discovery-topology-updater 16h
nvidia-gpu-feature-discovery ClusterRole/nvidia-gpu-feature-discovery 16h
checking deployments, daemonsets and crds
g@gsrv:~$ kubectl get deployments -A
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
calico-apiserver calico-apiserver 2/2 2 2 13d
calico-system calico-kube-controllers 1/1 1 1 13d
calico-system calico-typha 1/1 1 1 13d
clearml clearml-apiserver 1/1 1 1 12d
clearml clearml-fileserver 1/1 1 1 12d
clearml clearml-mongodb 1/1 1 1 12d
clearml clearml-webserver 1/1 1 1 12d
gpu-operator gpu-operator 1/1 1 1 16h
gpu-operator gpu-operator-1692720059-node-feature-discovery-master 1/1 1 1 16h
k8-storage nfs-subdir-external-provisioner 1/1 1 1 12d
kube-system coredns 2/2 2 2 13d
nuclio nuclio-controller 1/1 1 1 11d
nuclio nuclio-dashboard 1/1 1 1 11d
nuclio nuclio-test-nuctl-function-1 1/1 1 1 5d18h (Note: This is still workking because this just needs cpu)
nuclio nuclio-test-nuctl-function-2-retinanet 0/1 1 0 4d23h (Note: This is not working because this needs gpu)
tao-gnet ingress-nginx-controller 1/1 1 1 41h
tao-gnet tao-toolkit-api-app-pod 1/1 1 1 41h
tao-gnet tao-toolkit-api-workflow-pod 1/1 1 1 41h
tigera-operator tigera-operator 1/1 1 1 13d
g@gsrv:~$ kubectl get daemonsets -A
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-system calico-node 2 2 2 2 2 kubernetes.io/os=linux 13d
calico-system csi-node-driver 2 2 2 2 2 kubernetes.io/os=linux 13d
gpu-operator gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 16h
gpu-operator gpu-operator-1692720059-node-feature-discovery-worker 2 2 2 2 2 <none> 16h
gpu-operator nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 16h
gpu-operator nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 16h
gpu-operator nvidia-mig-manager 1 1 0 1 0 nvidia.com/gpu.deploy.mig-manager=true 16h
gpu-operator nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 16h
kube-system kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 13d
g@gsrv:~$ kubectl get crd
NAME CREATED AT
apiservers.operator.tigera.io 2023-08-09T17:40:28Z
bgpconfigurations.crd.projectcalico.org 2023-08-09T17:40:27Z
bgpfilters.crd.projectcalico.org 2023-08-09T17:40:28Z
bgppeers.crd.projectcalico.org 2023-08-09T17:40:28Z
blockaffinities.crd.projectcalico.org 2023-08-09T17:40:28Z
caliconodestatuses.crd.projectcalico.org 2023-08-09T17:40:28Z
clusterinformations.crd.projectcalico.org 2023-08-09T17:40:28Z
clusterpolicies.nvidia.com 2023-08-22T16:01:02Z
felixconfigurations.crd.projectcalico.org 2023-08-09T17:40:28Z
globalnetworkpolicies.crd.projectcalico.org 2023-08-09T17:40:28Z
globalnetworksets.crd.projectcalico.org 2023-08-09T17:40:28Z
hostendpoints.crd.projectcalico.org 2023-08-09T17:40:28Z
imagesets.operator.tigera.io 2023-08-09T17:40:28Z
installations.operator.tigera.io 2023-08-09T17:40:28Z
ipamblocks.crd.projectcalico.org 2023-08-09T17:40:28Z
ipamconfigs.crd.projectcalico.org 2023-08-09T17:40:28Z
ipamhandles.crd.projectcalico.org 2023-08-09T17:40:28Z
ippools.crd.projectcalico.org 2023-08-09T17:40:28Z
ipreservations.crd.projectcalico.org 2023-08-09T17:40:28Z
kubecontrollersconfigurations.crd.projectcalico.org 2023-08-09T17:40:28Z
networkpolicies.crd.projectcalico.org 2023-08-09T17:40:28Z
networksets.crd.projectcalico.org 2023-08-09T17:40:28Z
nodefeaturerules.nfd.k8s-sigs.io 2023-08-22T09:50:50Z
nodefeatures.nfd.k8s-sigs.io 2023-08-09T17:53:28Z
nuclioapigateways.nuclio.io 2023-08-11T11:08:55Z
nucliofunctionevents.nuclio.io 2023-08-11T11:08:55Z
nucliofunctions.nuclio.io 2023-08-11T11:08:55Z
nuclioprojects.nuclio.io 2023-08-11T11:08:55Z
tigerastatuses.operator.tigera.io 2023-08-09T17:40:29Z
Deleting
finding the generated chart name
g@gsrv:~$ helm list -n gpu-operator | grep gpu | awk '{print $1}'
gpu-operator-1692720059
deleting the chart
g@gsrv:~$ helm delete gpu-operator-1692720059
Error: uninstall: Release not loaded: gpu-operator-1692720059: release: not found
g@gsrv:~$ helm delete gpu-operator-1692720059
because I saw artifacts from gpu-operator-1692635184
I tried deleting it again
helm delete gpu-operator-1692635184 -n gpu-operator
Error: uninstall: Release not loaded: gpu-operator-1692635184: release: not found
deleting the stray clusterrole I found earlier to make sure it is all gone.
g@gsrv:~$ kubectl delete clusterroles gpu-operator-1692635184-node-feature-discovery
clusterrole.rbac.authorization.k8s.io "gpu-operator-1692635184-node-feature-discovery" deleted
uninstalling (but using the delete command as you’ve done)
helm delete gpu-operator-1692720059 -n gpu-operator
release "gpu-operator-1692720059" uninstalled
deleting the crd
g@gsrv:~$ kubectl delete crd clusterpolicies.nvidia.com
customresourcedefinition.apiextensions.k8s.io "clusterpolicies.nvidia.com" deleted
a check
g@gsrv:~$ helm uninstall --wait gpu-operator-1692635184 -n gpu-operator
Error: uninstall: Release not loaded: gpu-operator-1692635184: release: not found
g@gsrv:~$ helm uninstall --wait gpu-operator-1692720059 -n gpu-operator
Error: uninstall: Release not loaded: gpu-operator-1692720059: release: not found
Then I even deleted the namespace just to be sure it’s all gone
g@gsrv:~$ kubectl delete namespace gpu-operator
namespace "gpu-operator" deleted
Checking it is all gone.
g@gsrv:~$ helm ls -n gpu-operator
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
g@gsrv:~$ kubectl get clusterroles | grep gpu
g@gsrv:~$ kubectl get clusterrolebinding | grep gpu
g@gsrv:~$ kubectl get clusterrolebinding | grep nv
g@gsrv:~$ kubectl get clusterroles | grep nv
g@gsrv:~$ kubectl get crd
NAME CREATED AT
apiservers.operator.tigera.io 2023-08-09T17:40:28Z
bgpconfigurations.crd.projectcalico.org 2023-08-09T17:40:27Z
bgpfilters.crd.projectcalico.org 2023-08-09T17:40:28Z
bgppeers.crd.projectcalico.org 2023-08-09T17:40:28Z
blockaffinities.crd.projectcalico.org 2023-08-09T17:40:28Z
caliconodestatuses.crd.projectcalico.org 2023-08-09T17:40:28Z
clusterinformations.crd.projectcalico.org 2023-08-09T17:40:28Z
felixconfigurations.crd.projectcalico.org 2023-08-09T17:40:28Z
globalnetworkpolicies.crd.projectcalico.org 2023-08-09T17:40:28Z
globalnetworksets.crd.projectcalico.org 2023-08-09T17:40:28Z
hostendpoints.crd.projectcalico.org 2023-08-09T17:40:28Z
imagesets.operator.tigera.io 2023-08-09T17:40:28Z
installations.operator.tigera.io 2023-08-09T17:40:28Z
ipamblocks.crd.projectcalico.org 2023-08-09T17:40:28Z
ipamconfigs.crd.projectcalico.org 2023-08-09T17:40:28Z
ipamhandles.crd.projectcalico.org 2023-08-09T17:40:28Z
ippools.crd.projectcalico.org 2023-08-09T17:40:28Z
ipreservations.crd.projectcalico.org 2023-08-09T17:40:28Z
kubecontrollersconfigurations.crd.projectcalico.org 2023-08-09T17:40:28Z
networkpolicies.crd.projectcalico.org 2023-08-09T17:40:28Z
networksets.crd.projectcalico.org 2023-08-09T17:40:28Z
nodefeaturerules.nfd.k8s-sigs.io 2023-08-22T09:50:50Z
nodefeatures.nfd.k8s-sigs.io 2023-08-09T17:53:28Z
nuclioapigateways.nuclio.io 2023-08-11T11:08:55Z
nucliofunctionevents.nuclio.io 2023-08-11T11:08:55Z
nucliofunctions.nuclio.io 2023-08-11T11:08:55Z
nuclioprojects.nuclio.io 2023-08-11T11:08:55Z
tigerastatuses.operator.tigera.io 2023-08-09T17:40:29Z
Then I shut down the dgx (to unload the drivers if they were lingering about)
Reinstalling the chart
- Turned the DGX back on
- Waited the cluster (approx 15 minutes) to do negotiaions (in case it needed some time for that)
make sure everything is up and running
g@dgx:~$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-apiserver calico-apiserver-db54b987d-m66zz 1/1 Running 0 13d
calico-apiserver calico-apiserver-db54b987d-ncspt 1/1 Running 0 13d
calico-system calico-kube-controllers-666f5dcd4d-kj7fs 1/1 Running 0 13d
calico-system calico-node-j2ljx 1/1 Running 6 (27m ago) 13d
calico-system calico-node-trx99 1/1 Running 0 13d
calico-system calico-typha-585d9c9df4-x9c6k 1/1 Running 0 13d
calico-system csi-node-driver-slh5f 2/2 Running 0 13d
calico-system csi-node-driver-wf8n9 2/2 Running 12 (27m ago) 13d
clearml clearml-apiserver-76ff97d7f7-wcn6v 1/1 Running 0 21m
clearml clearml-elastic-master-0 1/1 Running 0 15m
clearml clearml-fileserver-ff756c4b8-fk59x 1/1 Running 0 21m
clearml clearml-mongodb-5f9468969b-bmc6s 1/1 Running 0 21m
clearml clearml-redis-master-0 1/1 Running 0 15m
clearml clearml-webserver-7f5fb5df5d-qpkbl 1/1 Running 0 21m
default cuda-vectoradd 0/1 Pending 0 21h
default gpu-test-job-v2zxg 0/1 Pending 0 42h
k8-storage nfs-subdir-external-provisioner-5669cc5b6-77gz5 1/1 Running 1 (15m ago) 21m
kube-system coredns-57575c5f89-9flb2 1/1 Running 0 13d
kube-system coredns-57575c5f89-nrd5f 1/1 Running 0 13d
kube-system etcd-gsrv 1/1 Running 0 13d
kube-system kube-apiserver-gsrv 1/1 Running 0 13d
kube-system kube-controller-manager-gsrv 1/1 Running 0 13d
kube-system kube-proxy-tzhrp 1/1 Running 6 (27m ago) 13d
kube-system kube-proxy-z4hxr 1/1 Running 0 13d
kube-system kube-scheduler-gsrv 1/1 Running 0 13d
nuclio nuclio-controller-679c44dcdc-nsmtm 1/1 Running 0 21m
nuclio nuclio-dashboard-6496cdfd66-ktnkb 1/1 Running 1 (15m ago) 21m
nuclio nuclio-test-nuctl-function-1-84b6bd65bd-wz9hm 1/1 Running 0 21m
nuclio nuclio-test-nuctl-function-2-retinanet-7d8545d7db-fv87n 0/1 Pending 0 41h
tao-gnet ingress-nginx-controller-78d54fbd-g9nrm 1/1 Running 0 21m
tao-gnet tao-toolkit-api-app-pod-5ffc48cd57-7d4mx 1/1 Running 0 21m
tao-gnet tao-toolkit-api-workflow-pod-6dbc7c8f98-5dl8n 1/1 Running 0 21m
tigera-operator tigera-operator-959786749-ctprw 1/1 Running 0 13d
Then I reinstlled the chart! (This time I used a new namespace gpu-operator-nvidia
instead the previous value gpu-operator
to be extra safe)
helm install --wait --generate-name \
-n gpu-operator-nvidia --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
Still no joy!! :(
then I looked for logs
- find all pods in the namespace
g@gsrv:~$ kubectl get pods -n gpu-operator-nvidia
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-c24v4 0/1 Init:0/1 0 18m
gpu-operator-1692786693-node-feature-discovery-master-f947hcgkb 1/1 Running 0 18m
gpu-operator-1692786693-node-feature-discovery-worker-flhl8 1/1 Running 0 18m
gpu-operator-1692786693-node-feature-discovery-worker-thvxj 1/1 Running 0 18m
gpu-operator-5747c5f6db-2cvht 1/1 Running 0 18m
nvidia-dcgm-exporter-zcxbb 0/1 Init:0/1 0 18m
nvidia-device-plugin-daemonset-mqdxh 0/1 Init:0/1 0 18m
nvidia-mig-manager-5n7sl 0/1 Init:0/1 0 18m
nvidia-operator-validator-ztgbb 0/1 Init:CrashLoopBackOff 8 (2m51s ago) 18m
Then got logs for all pods starting from the running ones
logs from gpu-operator-1692786693-node-feature-discovery-master-f947hcgkb
g@gsrv:~$ kubectl logs -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-master-f947hcgkb
W0823 10:31:41.346829 1 main.go:56] -featurerules-controller is deprecated, use '-crd-controller' flag instead
I0823 10:31:41.347026 1 nfd-master.go:181] Node Feature Discovery Master v0.13.1
I0823 10:31:41.347038 1 nfd-master.go:185] NodeName: "gsrv"
I0823 10:31:41.347048 1 nfd-master.go:186] Kubernetes namespace: "gpu-operator-nvidia"
I0823 10:31:41.347112 1 nfd-master.go:1091] config file "/etc/kubernetes/node-feature-discovery/nfd-master.conf" not found, using defaults
I0823 10:31:41.347303 1 nfd-master.go:1145] master (re-)configuration successfully completed
I0823 10:31:41.347319 1 nfd-master.go:202] starting nfd api controller
I0823 10:31:41.376543 1 component.go:36] [core][Server #1] Server created
I0823 10:31:41.376576 1 nfd-master.go:292] gRPC server serving on port: 8080
I0823 10:31:41.376641 1 component.go:36] [core][Server #1 ListenSocket #2] ListenSocket created
I0823 10:31:42.375973 1 nfd-master.go:601] will process all nodes in the cluster
not sure config file "/etc/kubernetes/node-feature-discovery/nfd-master.conf" not found, using defaults
is a problem as mentioned in the quote below from topic 226781
?
I checked the /etc/kubernetes/node-feature-discovery
directores in the two nodes
gsrv (master node)
g@gsrv:~$ tree /etc/kubernetes/node-feature-discovery
/etc/kubernetes/node-feature-discovery
├── features.d
└── source.d
2 directories, 0 files
dgx (gpu node)
g@dgx:~$ tree /etc/kubernetes/node-feature-discovery
/etc/kubernetes/node-feature-discovery
├── features.d
└── source.d
2 directories, 0 files
maybe the nfd-master.conf
got deleted or fomr some config error it is looking for a file that never existed?
However when I check for configmaps there seems to be one created with the generated name.
g@gsrv:~$ kubectl get configmap -n gpu-operator-nvidia
NAME DATA AGE
default-gpu-clients 1 70m
default-mig-parted-config 1 70m
gpu-operator-1692786693-node-feature-discovery-master-conf 1 70m
gpu-operator-1692786693-node-feature-discovery-topology-updater-conf 1 70m
gpu-operator-1692786693-node-feature-discovery-worker-conf 1 70m
kube-root-ca.crt 1 70m
nvidia-device-plugin-entrypoint 1 70m
nvidia-mig-manager-entrypoint 1 70m
when I run kubectl edit configmap -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-master-conf
It opens up for editing. so maybe this is not an issue?
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
nfd-master.conf: |-
extraLabelNs:
- nvidia.com
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: gpu-operator-1692786693
meta.helm.sh/release-namespace: gpu-operator-nvidia
creationTimestamp: "2023-08-23T10:31:39Z"
labels:
app.kubernetes.io/instance: gpu-operator-1692786693
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/version: v0.13.1
helm.sh/chart: node-feature-discovery-0.13.1
name: gpu-operator-1692786693-node-feature-discovery-master-conf
namespace: gpu-operator-nvidia
resourceVersion: "3422257"
uid: 37c2b25f-3b04-4028-a8be-f95a353544fc
logs from gpu-operator-1692786693-node-feature-discovery-worker-flhl8
g@gsrv:~$ kubectl logs -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-worker-flhl8
I0823 10:31:41.387093 1 nfd-worker.go:222] Node Feature Discovery Worker v0.13.1
I0823 10:31:41.387150 1 nfd-worker.go:223] NodeName: 'gsrv'
I0823 10:31:41.387158 1 nfd-worker.go:224] Kubernetes namespace: 'gpu-operator-nvidia'
I0823 10:31:41.387779 1 nfd-worker.go:518] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0823 10:31:41.387911 1 nfd-worker.go:550] worker (re-)configuration successfully completed
I0823 10:31:41.411725 1 local.go:115] starting hooks...
I0823 10:31:41.445933 1 nfd-worker.go:561] starting feature discovery...
I0823 10:31:41.446421 1 nfd-worker.go:573] feature discovery completed
I0823 10:31:41.466469 1 nfd-worker.go:694] creating NodeFeature object "gsrv"
I0823 10:32:41.423353 1 local.go:115] starting hooks...
I0823 10:32:41.484458 1 nfd-worker.go:561] starting feature discovery...
I0823 10:32:41.485156 1 nfd-worker.go:573] feature discovery completed
...
...
I0823 10:49:41.430477 1 local.go:115] starting hooks...
I0823 10:49:41.479448 1 nfd-worker.go:561] starting feature discovery...
I0823 10:49:41.480000 1 nfd-worker.go:573] feature discovery completed
...
...
logs from gpu-operator-1692786693-node-feature-discovery-worker-thvxj
g@gsrv:~$ kubectl logs -n gpu-operator-nvidia gpu-operator-1692786693-node-feature-discovery-worker-thvxj
I0823 10:31:40.964623 1 nfd-worker.go:222] Node Feature Discovery Worker v0.13.1
I0823 10:31:40.964651 1 nfd-worker.go:223] NodeName: 'dgx'
I0823 10:31:40.964655 1 nfd-worker.go:224] Kubernetes namespace: 'gpu-operator-nvidia'
I0823 10:31:40.966036 1 nfd-worker.go:518] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I0823 10:31:40.966088 1 nfd-worker.go:550] worker (re-)configuration successfully completed
I0823 10:31:40.975638 1 local.go:115] starting hooks...
I0823 10:31:40.993071 1 nfd-worker.go:561] starting feature discovery...
I0823 10:31:40.993439 1 nfd-worker.go:573] feature discovery completed
I0823 10:31:41.010433 1 nfd-worker.go:694] creating NodeFeature object "dgx"
I0823 10:32:41.000681 1 local.go:115] starting hooks...
I0823 10:32:41.017481 1 nfd-worker.go:561] starting feature discovery...
I0823 10:32:41.017801 1 nfd-worker.go:573] feature discovery completed
...
...
I0823 10:40:41.002458 1 local.go:115] starting hooks...
I0823 10:40:41.016863 1 nfd-worker.go:561] starting feature discovery...
I0823 10:40:41.017193 1 nfd-worker.go:573] feature discovery completed
...
...
logs from gpu-operator-5747c5f6db-2cvht
gpu-operator-5747c5f6db-2cvht.txt (4.7 MB)
logs from nvidia-dcgm-exporter-zcxbb
g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-dcgm-exporter-zcxbb
Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
Error from server (BadRequest): container "nvidia-dcgm-exporter" in pod "nvidia-dcgm-exporter-zcxbb" is waiting to start: PodInitializing
logs from nvidia-device-plugin-daemonset-mqdxh
g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-device-plugin-daemonset-mqdxh
Defaulted container "nvidia-device-plugin" out of: nvidia-device-plugin, toolkit-validation (init)
Error from server (BadRequest): container "nvidia-device-plugin" in pod "nvidia-device-plugin-daemonset-mqdxh" is waiting to start: PodInitializing
logs from nvidia-mig-manager-5n7sl
g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-mig-manager-5n7sl
Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init)
Error from server (BadRequest): container "nvidia-mig-manager" in pod "nvidia-mig-manager-5n7sl" is waiting to start: PodInitializing
logs from nvidia-operator-validator-ztgbb
g@gsrv:~$ kubectl logs -n gpu-operator-nvidia nvidia-operator-validator-ztgbb
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-ztgbb" is waiting to start: PodInitializing
Could us be having different experiences because the topologies are different? or could the “missing” nfd-master.conf
could be the cause of the problem?
I can refresh the cluster and see if a such file gets created?
or is there a way to drill in to the nvidia-operator-validator-ztgbb
or nvidia operator validator semantics to communicate what stage of operator validation is failing?
in the dgx (gpu node) /run/nvidia
directory is empty
g@dgx:~$ tree /run/nvidia/
/run/nvidia/
├── driver
└── validations
2 directories, 0 files
I tred deleting the /etc/kubernetes/node-feature-discovery
folder and reinstalling gpu-operator but it still failed to fix the issue (I attached the log below) Maybe the operator validater issue is not connected to that!
deleting-node-feature-discovery-folder.txt (17.7 KB)
logs for the sate of the cluster after the last attempt (23/08/2023).
logs_from_final_attempt_23_08_23.txt (44.0 KB)