Clara Deploy jobs randomly stuck in pending status

Hello,
I managed to install Clara Deploy on Ubuntu 20 and everything is running fine. To learn how to use Clara Deploy I am investigating the reference pipelines (liver, spleen, prostate). I am able to send images and jobs are created accordingly. Some jobs are successfully finished and return nice segmentations/results, but some others are stuck in “pending” status and do never finish. This seems random. I tried to get more information by looking into kubectl logs of various containers. Unfortunately I cannot find the appropriate error messages. Thus, I am asking for any support or any directions on how to further debug the problem of jobs staying in “pending” status.
Thank you very much,
Janis

Hi janis,

Thanks for your interest in Clara Deploy, and welcome to the forums.

Is it a specific type of pipeline that hangs pending? Some of the reference pipelines require more resources than others, and it’s possible the node does not meet these resource requirements.

Can you check the following?

clara describe job -j <job ID of pending job>

Also check the state of the node and kubernetes system pods:

kubectl describe node
kubectl get pods -A

If there are any pending or crashing pods associated with the pending jobs, check:

kubectl describe pod <pod name>
kubectl logs <pod name>

Thanks,
Kris

Hi janis,

If the pods for the pending jobs have been deleted, you can also pull logs using

clara logs -j <job ID> -o <operator name>

where you can find the operators of the job using

clara describe job -j <job ID>

Thanks,
Kris

Hi Kris,

thanks for your quick help. Unfortunately, I could not identify the pending or crashing pods associated with the pending jobs. Do you have any other advice? Please find exemplary output below.

Thanks,
Janis

clara describe job -j 24a4da8a38a64221986f9fc286bc3ab5
Job Description
Job_Name : prostate-prostate-20211011134354
Job_ID : 24a4da8a38a64221986f9fc286bc3ab5
Pipeline_ID : 78644ecff9854ad296ad778d4f2c1058
Payload_ID : e27d10a7dd9c49a6a7efcd76f977ee22
Job_Status : JOB_STATUS_HEALTHY
Job_State : JOB_STATE_PENDING
Job_Priority : JOB_PRIORITY_NORMAL
Messages :
Operators : [7]
NAME STATUS CREATED STARTED STOPPED
dicom-reader JOB_OPERATOR_STATUS_UNKNOWN - - -
dicom-seg-writer JOB_OPERATOR_STATUS_UNKNOWN - - -
prostate-segmentation JOB_OPERATOR_STATUS_UNKNOWN - - -
register-dicom-output JOB_OPERATOR_STATUS_UNKNOWN - - -
register-dicom-rtstruct-output JOB_OPERATOR_STATUS_UNKNOWN - - -
register-volume-images-for-rendering JOB_OPERATOR_STATUS_UNKNOWN - - -
rtstruct-writer JOB_OPERATOR_STATUS_UNKNOWN - - -

kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default clara-clara-platformapiserver-75bc9b4867-nsksz 1/1 Running 0 18h
default clara-console-6d944c97d5-vxjr5 2/2 Running 0 18h
default clara-dicom-adapter-7fb4cc587b-fjq9j 1/1 Running 0 18h
default clara-node-monitor-l4zsb 1/1 Running 0 18h
default clara-pipesvc-b40216de-trtis-6bbb4c4f6b-bjqvr 1/1 Running 0 18h
default clara-render-server-clara-renderer-6b9d4799f6-k6qjr 3/3 Running 0 18h
default clara-resultsservice-f758699b-8rggq 1/1 Running 0 18h
default clara-ui-758d9645b7-w2g72 1/1 Running 0 18h
default clara-workflow-controller-7c66d77f55-gcpnb 1/1 Running 0 18h
default fluentd-z7d9k 1/1 Running 0 18h
kube-system coredns-f9fd979d6-24wzv 1/1 Running 0 4d17h
kube-system coredns-f9fd979d6-bgwjb 1/1 Running 0 4d17h
kube-system etcd-deploy 1/1 Running 0 4d17h
kube-system kube-apiserver-deploy 1/1 Running 0 4d17h
kube-system kube-controller-manager-deploy 1/1 Running 0 4d17h
kube-system kube-flannel-ds-4nrpc 1/1 Running 0 4d17h
kube-system kube-proxy-trxln 1/1 Running 0 4d17h
kube-system kube-scheduler-deploy 1/1 Running 0 4d17h
kube-system nvidia-device-plugin-2l966 1/1 Running 0 4d17h

kubectl describe node
Name: deploy
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=deploy
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data: {“VtepMAC”:“6a:c0:53:0c:7d:c3”}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.47.50.186
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 07 Oct 2021 16:22:06 +0200
Taints:
Unschedulable: false
Lease:
HolderIdentity: deploy
AcquireTime:
RenewTime: Tue, 12 Oct 2021 09:51:05 +0200
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Thu, 07 Oct 2021 16:22:34 +0200 Thu, 07 Oct 2021 16:22:34 +0200 FlannelIsUp Flannel is running on this node
MemoryPressure False Tue, 12 Oct 2021 09:50:28 +0200 Thu, 07 Oct 2021 16:22:04 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 12 Oct 2021 09:50:28 +0200 Thu, 07 Oct 2021 16:22:04 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 12 Oct 2021 09:50:28 +0200 Thu, 07 Oct 2021 16:22:04 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 12 Oct 2021 09:50:28 +0200 Thu, 07 Oct 2021 16:22:33 +0200 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.47.50.186
Hostname: deploy
Capacity:
cpu: 6
ephemeral-storage: 1921285344Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32845716Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 6
ephemeral-storage: 1770656570099
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32743316Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: af2f2438f5f64c72995b65cb52be4a90
System UUID: 00000000-0000-0000-0000-309c23d0ae96
Boot ID: 1e98a66f-2003-4f3d-bb7b-cbfc1308c489
Kernel Version: 5.11.0-37-generic
OS Image: Ubuntu 20.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.7
Kubelet Version: v1.19.4
Kube-Proxy Version: v1.19.4
PodCIDR: 10.254.0.0/24
PodCIDRs: 10.254.0.0/24
Non-terminated Pods: (19 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


default clara-clara-platformapiserver-75bc9b4867-nsksz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-console-6d944c97d5-vxjr5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-dicom-adapter-7fb4cc587b-fjq9j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-node-monitor-l4zsb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-pipesvc-b40216de-trtis-6bbb4c4f6b-bjqvr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-render-server-clara-renderer-6b9d4799f6-k6qjr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-resultsservice-f758699b-8rggq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-ui-758d9645b7-w2g72 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default clara-workflow-controller-7c66d77f55-gcpnb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default fluentd-z7d9k 100m (1%) 0 (0%) 200Mi (0%) 512Mi (1%) 18h
kube-system coredns-f9fd979d6-24wzv 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 4d17h
kube-system coredns-f9fd979d6-bgwjb 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 4d17h
kube-system etcd-deploy 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d17h
kube-system kube-apiserver-deploy 250m (4%) 0 (0%) 0 (0%) 0 (0%) 4d17h
kube-system kube-controller-manager-deploy 200m (3%) 0 (0%) 0 (0%) 0 (0%) 4d17h
kube-system kube-flannel-ds-4nrpc 100m (1%) 100m (1%) 50Mi (0%) 50Mi (0%) 4d17h
kube-system kube-proxy-trxln 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d17h
kube-system kube-scheduler-deploy 100m (1%) 0 (0%) 0 (0%) 0 (0%) 4d17h
kube-system nvidia-device-plugin-2l966 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d17h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 950m (15%) 100m (1%)
memory 390Mi (1%) 902Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:

clara logs -j 24a4da8a38a64221986f9fc286bc3ab5 -o dicom-reader
Error: Code: -8459, Unable to fetch operator logs for job 24a4da8a38a64221986f9fc286bc3ab5 as job has not started yet.;Unable to fetch logs of job {24a4da8a38a64221986f9fc286bc3ab5} since job state is Pending

clara logs -j 24a4da8a38a64221986f9fc286bc3ab5 -o prostate-segmentation
Error: Code: -8459, Unable to fetch operator logs for job 24a4da8a38a64221986f9fc286bc3ab5 as job has not started yet.;Unable to fetch logs of job {24a4da8a38a64221986f9fc286bc3ab5} since job state is Pending

Hi Kris,

unfortunately, with the logs you recommended, I couldn’t narrow down the issue further. Do you have any other hints how or where I could find corresponding error messages?

Thanks,
Janis