Need help with "Run Reference Pipelines using Local Input Files"

Hi, I have a question there and hope someone can help me.

I am running reference pipelines using local input files like
https://docs.nvidia.com/clara/deploy_archive/R6_2/RunningReferencePipeline.html

After
$clara create pipelines -p chestxray-pipeline.yaml
$clara create jobs -n xray-test-2 -p -f input/png/

I have created the pipeline and job of “clara_ai_chestxray_pipeline” successfully

$clara start job -j

but get evicted pods after starting the job

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
clara-clara-platformapiserver-55c46b8448-f8jz5 1/1 Running 0 85m
clara-dicom-adapter-96948fff7-5ppzp 0/1 Evicted 0 42m
clara-dicom-adapter-96948fff7-8pvzk 0/1 Evicted 0 43m
clara-dicom-adapter-96948fff7-cnh64 0/1 Evicted 0 42m
clara-dicom-adapter-96948fff7-db8f6 0/1 Evicted 0 43m
clara-dicom-adapter-96948fff7-fnnhx 0/1 Evicted 0 43m
clara-dicom-adapter-96948fff7-h6zdx 1/1 Running 0 42m
clara-dicom-adapter-96948fff7-jvkhl 0/1 Evicted 0 43m
clara-dicom-adapter-96948fff7-kz29h 0/1 Evicted 0 86m
clara-dicom-adapter-96948fff7-plxnw 0/1 Evicted 0 43m
clara-dicom-adapter-96948fff7-pwbmp 0/1 Evicted 0 43m
clara-dicom-adapter-96948fff7-ql24b 0/1 Evicted 0 43m
clara-monitor-server-fluentd-elasticsearch-6q5lc 1/1 Running 553 4d23h
clara-monitor-server-grafana-5f874b974d-2d9fd 1/1 Running 3 4d23h
clara-monitor-server-monitor-server-6955f85dbf-fq2hv 0/1 CrashLoopBackOff 1414 4d23h
clara-node-monitor-z8sxr 0/1 CrashLoopBackOff 2248 7d22h
clara-pipesvc-b40216de-trtis-65d9c5d5f4-tb87s 0/1 CrashLoopBackOff 13 43m
clara-render-server-clara-renderer-c9ccfc47c-fntz7 2/3 CrashLoopBackOff 11 41m
clara-render-server-clara-renderer-c9ccfc47c-vdqn8 0/3 Evicted 0 85m
clara-resultsservice-5d98fdd785-8km4x 1/1 Running 3 5h36m
clara-ui-6f89b97df8-nj2cf 1/1 Running 3 7d22h
clara-workflow-controller-69cbb55fc8-44lt9 1/1 Running 3 7d22h
elasticsearch-master-0 0/1 Running 0 42m
elasticsearch-master-1 0/1 Running 0 42m
fluentd-6s4cc 1/1 Running 2 4h45m
xray-test-2-l7rpz-3117232601 0/2 Evicted 0 43m

I also unable to connect localhost:8000 and download output files.
$clara download :/operators/ai-app-chestxray/*
Downloading…
Downloading 0 files
No matching file found

Does anybody know how to fix this problem? Thank you in advance.

Hi gxiangc,

Thanks for your interest in the Clara Deploy platform. Evicted pods typically point to insufficient node resources. Can you provide the output of the following?
kubectl get pods -A
kubectl describe node

Thanks,
Kris

part of the output of "kubectl get pods -A"

NAMESPACE NAME READY STATUS RESTARTS AGE
default clara-render-server-clara-renderer-c9ccfc47c-s2t2m 0/3 Evicted 0 8d
default clara-render-server-clara-renderer-c9ccfc47c-s2wf2 0/3 Evicted 0 7d22h
default clara-render-server-clara-renderer-c9ccfc47c-s456t 0/3 Evicted 0 3d18h


default clara-resultsservice-5d98fdd785-2ms2p 0/1 Evicted 0 47h
default clara-resultsservice-5d98fdd785-2qmzr 0/1 Evicted 0 22h
default clara-resultsservice-5d98fdd785-2zp2f 0/1 Evicted 0 7h39m
default clara-resultsservice-5d98fdd785-45k84 0/1 Evicted 0 6d19h


default clara-ui-6f89b97df8-4qjj5 0/1 Evicted 0 22h
default clara-ui-6f89b97df8-9hbhb 0/1 Evicted 0 23h
default clara-ui-6f89b97df8-gwprn 0/1 Evicted 0 21h
default clara-ui-6f89b97df8-hqgl9 0/1 Evicted 0 20h
default clara-ui-6f89b97df8-knfqt 1/1 Running 0 19h
default clara-ui-6f89b97df8-nj2cf 0/1 Evicted 0 19d
default clara-ui-6f89b97df8-ntdz2 0/1 Evicted 0 22h
default clara-ui-6f89b97df8-ntwwm 0/1 Evicted 0 23h
default clara-ui-6f89b97df8-p99bv 0/1 Evicted 0 22h
default clara-ui-6f89b97df8-wv76g 0/1 Evicted 0 20h
default clara-ui-6f89b97df8-zbkkh 0/1 Evicted 0 21h
default clara-ui-6f89b97df8-gwprn 0/1 Evicted 0 21h
default clara-ui-6f89b97df8-hqgl9 0/1 Evicted 0 20h
default clara-ui-6f89b97df8-knfqt 1/1 Running 0 19h
default clara-ui-6f89b97df8-nj2cf 0/1 Evicted 0 19d
default clara-ui-6f89b97df8-ntdz2 0/1 Evicted 0 22h
default clara-ui-6f89b97df8-ntwwm 0/1 Evicted 0 23h
default clara-ui-6f89b97df8-p99bv 0/1 Evicted 0 22h
default clara-ui-6f89b97df8-wv76g 0/1 Evicted 0 20h
default clara-ui-6f89b97df8-zbkkh 0/1 Evicted 0 21h
default clara-ui-6f89b97df8-zm2pl 0/1 Evicted 0 19h
default clara-workflow-controller-69cbb55fc8-2hd2z 0/1 Evicted 0 22h
default clara-workflow-controller-69cbb55fc8-44lt9 0/1 Evicted 0 19d
default clara-workflow-controller-69cbb55fc8-4w2f6 0/1 Evicted 0 29h
default clara-workflow-controller-69cbb55fc8-72wn4 0/1 Evicted 0 10h
default clara-workflow-controller-69cbb55fc8-79hr2 0/1 Evicted 0 6d12h
default clara-workflow-controller-69cbb55fc8-9f77k 1/1 Running 0 3h52m
default clara-workflow-controller-69cbb55fc8-hrsdl 0/1 Evicted 0 23h
default clara-workflow-controller-69cbb55fc8-kkg8l 0/1 Evicted 0 6d16h
default clara-workflow-controller-69cbb55fc8-mmxfl 0/1 Evicted 0 21h
default clara-workflow-controller-69cbb55fc8-pzpmb 0/1 Evicted 0 21h
default clara-workflow-controller-69cbb55fc8-r2tcs 0/1 Evicted 0 2d1h
default clara-workflow-controller-69cbb55fc8-rssz7 0/1 Evicted 0 23h
default clara-workflow-controller-69cbb55fc8-t76hz 0/1 Evicted 0 22h
default clara-workflow-controller-69cbb55fc8-wms47 0/1 Evicted 0 19h
default clara-workflow-controller-69cbb55fc8-zvvj4 0/1 Evicted 0 22h
default elasticsearch-master-0 0/1 Init:0/4 0 23m
default elasticsearch-master-1 0/1 Init:0/4 0 22m
default fluentd-6nv9m 0/1 Evicted 0 21m
kube-system coredns-bccdc95cf-2n695 1/1 Running 0 19h
kube-system coredns-bccdc95cf-d7sdm 0/1 Evicted 0 21h
kube-system coredns-bccdc95cf-g8wkq 1/1 Running 0 19h
kube-system coredns-bccdc95cf-h7hkk 0/1 Evicted 0 21h
kube-system coredns-bccdc95cf-jrgcg 0/1 Evicted 0 53d
kube-system coredns-bccdc95cf-zrkbj 0/1 Evicted 0 53d
kube-system etcd-user 1/1 Running 5 53d
kube-system kube-apiserver-user 1/1 Running 5 53d
kube-system kube-controller-manager-user 1/1 Running 6 53d
kube-system kube-proxy-hcxkq 1/1 Running 0 19h
kube-system kube-scheduler-user 1/1 Running 5 53d
kube-system tiller-deploy-659c6788f5-pfqdg 0/1 Evicted 0 12d
kube-system tiller-deploy-659c6788f5-tmzfz 0/1 ImagePullBackOff 0 23h

the output of "kubectl describe node"

Name: user
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=user
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sat, 12 Dec 2020 10:31:24 +0000
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Thu, 04 Feb 2021 08:54:14 +0000 Sat, 12 Dec 2020 10:31:21 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 04 Feb 2021 08:54:14 +0000 Thu, 04 Feb 2021 08:34:19 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 04 Feb 2021 08:54:14 +0000 Sat, 12 Dec 2020 10:31:21 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 04 Feb 2021 08:54:14 +0000 Sat, 12 Dec 2020 10:33:14 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.104.141.169
Hostname: user
Capacity:
cpu: 64
ephemeral-storage: 123330112Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 528012260Ki
pods: 110
Allocatable:
cpu: 64
ephemeral-storage: 113661031032
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 527909860Ki
pods: 110
System Info:
Machine ID: eac4f33ede0049208a503de025e7f501
System UUID: 545D05B4-19F6-03E4-B211-D21D007BD91B
Boot ID: 974a019b-25c1-4ce9-8fe6-8c2089252670
Kernel Version: 4.15.0-132-generic
OS Image: Ubuntu 18.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.8
Kubelet Version: v1.15.4
Kube-Proxy Version: v1.15.4
PodCIDR: 10.244.0.0/24
Non-terminated Pods: (21 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


default clara-clara-platformapiserver-55c46b8448-776p8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
default clara-dicom-adapter-96948fff7-pqvpj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
default clara-monitor-server-grafana-5f874b974d-29vjh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
default clara-monitor-server-monitor-server-6955f85dbf-fq2hv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16d
default clara-node-monitor-z8sxr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 19d
default clara-pipesvc-b40216de-trtis-65d9c5d5f4-tb87s 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12d
default clara-render-server-clara-renderer-c9ccfc47c-lhcgg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
default clara-resultsservice-5d98fdd785-krlb2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
default clara-ui-6f89b97df8-knfqt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 19h
default clara-workflow-controller-69cbb55fc8-9f77k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h56m
default elasticsearch-master-0 100m (0%) 1 (1%) 2Gi (0%) 2Gi (0%) 26m
default elasticsearch-master-1 100m (0%) 1 (1%) 2Gi (0%) 2Gi (0%) 26m
default fluentd-tzvxp 100m (0%) 0 (0%) 200Mi (0%) 512Mi (0%) 2m28s
kube-system coredns-bccdc95cf-2n695 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 19h
kube-system coredns-bccdc95cf-g8wkq 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 19h
kube-system etcd-user 0 (0%) 0 (0%) 0 (0%) 0 (0%) 53d
kube-system kube-apiserver-user 250m (0%) 0 (0%) 0 (0%) 0 (0%) 53d
kube-system kube-controller-manager-user 200m (0%) 0 (0%) 0 (0%) 0 (0%) 53d
kube-system kube-proxy-hcxkq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 19h
kube-system kube-scheduler-user 100m (0%) 0 (0%) 0 (0%) 0 (0%) 53d
kube-system tiller-deploy-659c6788f5-tmzfz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 23h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 1050m (1%) 2 (3%)
memory 4436Mi (0%) 4948Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message


Warning EvictionThresholdMet 26m (x3908 over 12d) kubelet, user Attempting to reclaim ephemeral-storage
Normal NodeHasNoDiskPressure 20m (x1227 over 12d) kubelet, user Node user status is now: NodeHasNoDiskPressure
Warning ImageGCFailed 3m54s (x3084 over 12d) kubelet, user (combined from similar events): wanted to free 5906209177 bytes, but freed 0 bytes space with errors in image deletion: [rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “nvidia/cuda:10.0-base” (must force) - container 282ac5ecd211 is using its referenced image 0f12aac8787e, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “nvidia/cuda:11.0-base” (must force) - container 2926cb8b2934 is using its referenced image 2ec708416bb8, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “hello-world:latest” (must force) - container 1d3602346f48 is using its referenced image bf756fb1ae65]