Hello. The status of the pods named nvidia-smi-admin-ops01 and gpu-operator-1680082681-node-feature-discovery-master-8dc9pt4qz became Error when the ephemeral-storage error occured.
However. After I cleaned up the disk space and freed up the space to 49% Use as the picture below, the two pods still in the Error status.
How should I do to fix the two pods which are still in error status ?
The error pods
The description of pod named nvidia-smi-admin-ops01
Name: nvidia-smi-admin-ops01
Namespace: default
Priority: 0
Node: admin-ops01/192.168.101.8
Start Time: Wed, 29 Mar 2023 08:24:37 +0000
Labels: run=nvidia-smi-admin-ops01
Annotations: cni.projectcalico.org/containerID: 1c5d35d135cda4f258b7a5fcf10976e8a3cc2a59cf4981882681c3aac3db526e
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container nvidia-smi-admin-ops01 was using 104Ki, which exceeds its request of 0.
IP: 192.168.33.84
IPs:
IP: 192.168.33.84
Containers:
nvidia-smi-admin-ops01:
Container ID: containerd://996a7b896d91b2c55cf541d799bfc5f4b4656bd107498d99c23feec7d47bf086
Image: nvidia/cuda:11.0.3-base
Image ID: docker.io/nvidia/cuda@sha256:7258839ddbf814d0d6da6c730293bd4ba7b8d1455da84948bb7e4f10111a8b91
Port: <none>
Host Port: <none>
Args:
sleep
infinity
State: Terminated
Reason: Error
Exit Code: 137
Started: Wed, 29 Mar 2023 09:56:10 +0000
Finished: Mon, 03 Apr 2023 05:59:38 +0000
Ready: False
Restart Count: 4
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pqf8d (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-pqf8d:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=admin-ops01
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
The description of pod named gpu-operator-1680082681-node-feature-discovery-master-8dc9pt4qz
Name: gpu-operator-1680082681-node-feature-discovery-master-8dc9pt4qz
Namespace: gpu-operator
Priority: 0
Node: admin-ops01/192.168.101.8
Start Time: Wed, 29 Mar 2023 09:38:03 +0000
Labels: app.kubernetes.io/instance=gpu-operator-1680082681
app.kubernetes.io/name=node-feature-discovery
pod-template-hash=8dc97d954
role=master
Annotations: cni.projectcalico.org/containerID: ab97d5111239f0a75c3a6d3441abf56c6afddfd8a16f0dc2c476de204f67cf3a
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container master was using 972Ki, which exceeds its request of 0.
IP: 192.168.33.70
IPs:
IP: 192.168.33.70
Controlled By: ReplicaSet/gpu-operator-1680082681-node-feature-discovery-master-8dc97d954
Containers:
master:
Container ID: containerd://679e950f4c695821b2af88f6924c086c406c55adb16e65faf17cec70f7167181
Image: k8s.gcr.io/nfd/node-feature-discovery:v0.10.1
Image ID: k8s.gcr.io/nfd/node-feature-discovery@sha256:4aebf17c8b72ee91cb468a6f21dd9f0312c1fcfdf8c86341f7aee0ec2d5991d7
Port: 8080/TCP
Host Port: 0/TCP
Command:
nfd-master
Args:
--extra-label-ns=nvidia.com
-featurerules-controller=true
State: Terminated
Reason: Error
Exit Code: 2
Started: Wed, 29 Mar 2023 09:53:56 +0000
Finished: Mon, 03 Apr 2023 05:58:46 +0000
Ready: False
Restart Count: 1
Liveness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=5s timeout=1s period=10s #success=1 #failure=10
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7lbk7 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-7lbk7:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>