Kubernetes Cluster Misbehaving after power outages

We are currently facing an issue with our Kubernetes cluster following a recent power outage and would appreciate your assistance in restoring it.

Cluster Configuration :

  • 1 x BCM Head Node
  • 1 x K8s/Slurm Master Node (separate from head node)
  • 3 x DGX H100 Worker Nodes

Issue Overview:
After the power outage, the entire cluster went down. We followed a sequential restart process—first powering on the master nodes, then the worker nodes.

However, we encountered the following issues:

  1. Kubernetes Command Failure:
    Running kubectl get nodes returned an error: “Could not get server API group list.”

  2. Service Status Check:
    Upon checking systemctl status for both kubelet and containerd, we found both services were inactive.
    Kubelet was inactive because of missing config file /var/libkubelet

  3. Service Restart Attempt:
    We attempted to bring the services up using the following commands:-

systemctl enable containerd

systemctl start containerd

kubeadm init phase kubelet-start

systemctl daemon-reexec

systemctl restart kubelet

containerd successfully started.

kubelet started initially but exited shortly afterward.

Kubelet Configuration Error:

  • We discovered that /var/lib/kubelet/config.yaml was missing.

  • After regenerating the configuration, kubelet failed again—this time due to inability to load the CA certificate from the expected location

  • After copying the required certificate to the location specified in the error, the service then threw a bootstrapping error, and we appear to be stuck in a loop. Can you please let me know if i am missing any steps to fix these .