We are currently facing an issue with our Kubernetes cluster following a recent power outage and would appreciate your assistance in restoring it.
Cluster Configuration :
- 1 x BCM Head Node
- 1 x K8s/Slurm Master Node (separate from head node)
- 3 x DGX H100 Worker Nodes
Issue Overview:
After the power outage, the entire cluster went down. We followed a sequential restart process—first powering on the master nodes, then the worker nodes.
However, we encountered the following issues:
-
Kubernetes Command Failure:
Running kubectl get nodes returned an error: “Could not get server API group list.”
-
Service Status Check:
Upon checking systemctl status for both kubelet and containerd, we found both services were inactive.
Kubelet was inactive because of missing config file /var/libkubelet -
Service Restart Attempt:
We attempted to bring the services up using the following commands:-
systemctl enable containerd
systemctl start containerd
kubeadm init phase kubelet-start
systemctl daemon-reexec
systemctl restart kubelet
containerd successfully started.
kubelet started initially but exited shortly afterward.
Kubelet Configuration Error:
-
We discovered that /var/lib/kubelet/config.yaml was missing.
-
After regenerating the configuration, kubelet failed again—this time due to inability to load the CA certificate from the expected location
-
After copying the required certificate to the location specified in the error, the service then threw a bootstrapping error, and we appear to be stuck in a loop. Can you please let me know if i am missing any steps to fix these .