Clara Insufficient resources

Good Morning,

we would run a pipeline using local input covid files, but launching

clara create jobs -n covid-test-1 -p -f ./input/dcm/

we get

“Unable to create job. Error:Code: -8454, Insufficient resources, can not launch job.”

We are on AWS with Tesla T4 GPU, 4C/4T CPU @ 2.50GHz, 15GB RAM, 19GB available disk (63% used).

Any idea about it? Thank you in advance.

Best Regards

Have you tried to looked at the known issue Section 20.1.3.2. Clara Deploy SDK from NGC Only Uses a Single GPU in the NVIDIA Clara Deploy SDK User Guide? If not, please configure the availableGpus to at least 2.

The COVID-19 reference pipeline consists of two AI inference operators (containers), and the Platform tries to create two instances of the Triton Inference Server (formerly TRTIS) each with a GPU. But If the above step is not done, the Trtion (formerly TRTIS) instances would not be started by the Clara Deploy platform due to only one GPU is available to the pipeline.

This is a known issue and it requires (re)configuration of available GPU regardless of number of actual GPU’s are present.

So, please change the availableGpus to be greater than 1 (even if only one physical GPU is present). Restart the platform and trying creating and running the pipeline again.

Apologies for not clearly linking the known issue to the COVID-19 pipeline, and hope this helps.

Good morning,

changed availableGpus to 2 or greater value (3, 4, …) and restarted platform everytime: always the same error. Thanks in advance for your response.

Best regards

Good morning,

any idea about that issue? Thanks in advance for your response.

Best Regards

Good morning.

Engineers who worked on the specific API had been requested to help look into the issue.

Best regards

Good morning,

we tested other configurations on aws (4 Tesla T4 GPUs, 48 vcpus, 186 GB memory, 94GB available disk): all the pipelines work, but not the covid one. Always the same error. Thanks in advance for your response.

Best Regards

It is good that all other pipeline works. So, may I circle back and ask if you have checked the availableGpus according to the release note mentioned above? If not, please reconfigure it and restart the platform (unfortunately this will remove any already published pipelines).

Please also be advised, a new version of the COVID-19 pipeline was published on NGC on June 10th with improvements to fix an issue with processing positive case inputs.

Good morning,

yes, we changed that availableGpus to 4 and tested the new COVID-19 pipeline too, but always the same error (‘Unable to create job. Error:Code: -8454, Insufficient resources, can not launch job.’).

The instance type we’re using is the ‘g4dn.12xlarge’: could you do a check with this one? Thanks in advance for your response.

Best Regards

Hello @NDev

Could you please check with the following commands to see if resources are already allocated for other pods/services?

kubectl describe nodes  # to see capacity/available resources like below
    Capacity:
     cpu:                12
     ephemeral-storage:  959200352Ki
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             65639056Ki
     nvidia.com/gpu:     1
     pods:               110
    Allocatable:
     cpu:                12
     ephemeral-storage:  883999042940
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             65536656Ki
     nvidia.com/gpu:     1
     pods:               110
kubectl get all | grep pipesvc   # To see if there are long-running services for TRITON inference server (launching a pipeline may launch a pipeline service which is a TRITON inference server and the service may keep holding GPU resources)
# If you can see existing pipeline services before launching COVID-19 pipelines, you can delete the existing pipeline services by using the following command
kubectl get all | grep pipesvc | egrep 'deployment|service' | awk '{print $1}' | xargs kubectl delete
# Also please check the GPU status
nvidia-smi

Thanks.

Good Morning,

required checks are shown below:

kubectl describe nodes
Name: ip-172-31-47-129
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-31-47-129
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data: {“VtepMAC”:“2a:17:9c:d2:d9:8b”}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 172.31.47.129
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 18 Jun 2020 10:35:54 +0000
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Fri, 19 Jun 2020 08:16:21 +0000 Thu, 18 Jun 2020 10:35:53 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 19 Jun 2020 08:16:21 +0000 Thu, 18 Jun 2020 10:35:53 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 19 Jun 2020 08:16:21 +0000 Thu, 18 Jun 2020 10:35:53 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 19 Jun 2020 08:16:21 +0000 Thu, 18 Jun 2020 10:36:18 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 172.31.47.129
Hostname: ip-172-31-47-129
Capacity:
cpu: 48
ephemeral-storage: 32461564Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 195889568Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 48
ephemeral-storage: 29916577333
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 195787168Ki
nvidia.com/gpu: 4
pods: 110
System Info:
Machine ID: 5ac799bfebaf40068ba766a37cbf633b
System UUID: EC2B30B5-DB1C-8D23-69B5-9E77F0345677
Boot ID: 821c9e90-51e5-47ac-93c9-1f4110e9a19c
Kernel Version: 4.15.0-1063-aws
OS Image: Ubuntu 18.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.15.4
Kube-Proxy Version: v1.15.4
PodCIDR: 10.244.0.0/24
Non-terminated Pods: (23 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


default clara-clara-platformapiserver-7dc6c6699f-nw46t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-console-mongodb-85f8bd5f95-p59g5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-dicom-adapter-77677c7788-fkntt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-monitor-server-fluentd-elasticsearch-nmd78 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-monitor-server-grafana-5f874b974d-kgzm6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-monitor-server-monitor-server-868c5fcf89-kzccv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-render-server-clara-renderer-789fcb6cb6-52d9w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-resultsservice-bcd9ff49d-gx8bt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-ui-6f89b97df8-6lj4k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-ux-77c9b96ccb-qgjps 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default clara-workflow-controller-69cbb55fc8-f962s 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
default elasticsearch-master-0 100m (0%) 1 (2%) 2Gi (1%) 2Gi (1%) 21h
default elasticsearch-master-1 100m (0%) 1 (2%) 2Gi (1%) 2Gi (1%) 21h
kube-system coredns-5c98db65d4-78sgf 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 21h
kube-system coredns-5c98db65d4-wdhhn 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 21h
kube-system etcd-ip-172-31-47-129 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
kube-system kube-apiserver-ip-172-31-47-129 250m (0%) 0 (0%) 0 (0%) 0 (0%) 21h
kube-system kube-controller-manager-ip-172-31-47-129 200m (0%) 0 (0%) 0 (0%) 0 (0%) 21h
kube-system kube-flannel-ds-amd64-qxdk6 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 21h
kube-system kube-proxy-ln9qc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
kube-system kube-scheduler-ip-172-31-47-129 100m (0%) 0 (0%) 0 (0%) 0 (0%) 21h
kube-system nvidia-device-plugin-daemonset-9mj4f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
kube-system tiller-deploy-7bf78cdbf7-f4g9s 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 1050m (2%) 2100m (4%)
memory 4286Mi (2%) 4486Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message


Normal NodeHasSufficientMemory 21h (x7 over 21h) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 21h (x7 over 21h) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 21h (x6 over 21h) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientPID
Normal Starting 21h kubelet, ip-172-31-47-129 Starting kubelet.
Normal NodeHasSufficientMemory 21h kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 21h kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 21h kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 21h kubelet, ip-172-31-47-129 Updated Node Allocatable limit across pods
Normal Starting 21h kube-proxy, ip-172-31-47-129 Starting kube-proxy.
Normal NodeReady 21h kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeReady
Normal NodeAllocatableEnforced 21h kubelet, ip-172-31-47-129 Updated Node Allocatable limit across pods
Normal Starting 21h kubelet, ip-172-31-47-129 Starting kubelet.
Normal NodeHasNoDiskPressure 21h (x8 over 21h) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 21h (x8 over 21h) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientPID
Normal NodeHasSufficientMemory 21h (x7 over 21h) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientMemory
Normal Starting 21h kube-proxy, ip-172-31-47-129 Starting kube-proxy.
Normal Starting 2m3s kubelet, ip-172-31-47-129 Starting kubelet.
Normal NodeHasSufficientMemory 2m3s (x8 over 2m3s) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m3s (x8 over 2m3s) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m3s (x7 over 2m3s) kubelet, ip-172-31-47-129 Node ip-172-31-47-129 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 2m3s kubelet, ip-172-31-47-129 Updated Node Allocatable limit across pods
Normal Starting 116s kube-proxy, ip-172-31-47-129 Starting kube-proxy.

kubectl get all | grep pipesvc
NOTHING

nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 48C P0 28W / 70W | 761MiB / 15109MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 34C P8 9W / 70W | 11MiB / 15109MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 34C P8 9W / 70W | 11MiB / 15109MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 9W / 70W | 11MiB / 15109MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7545 C /app/NvRTVolOptixRenderServer 749MiB |
±----------------------------------------------------------------------------+

Thanks in advance for your response.

Best Regards

Hi @NDev,

Could you please open the ~/.clara/charts/clara/values.yaml file and change availableGpus to availableGpus=-1 to disable gpu checks?
After that, you would need to redeploy Clara by the following command and republish the pipeline, and perform the pipeline execution steps again.

# stop clara
clara console stop
clara monitor stop
clara render stop
clara dicom stop
clara platform stop

# start Clara
clara platform start
clara dicom start
clara render start
clara monitor start
clara console start

Good Morning,

setting that availableGpus to -1, it works. So, to understand, where was the problem?

Furthermore, the execution seems a little slow: for example, for 400 dcm images it takes about 2 minutes: is it a consistent time?

Thanks in advance for your response.

Best Regards

g4dn.12xlarge has 4 GPUs, so there is no reason that setting availableGpus=4 would not work, but glad it is working for you. BTW, the availableGpus is now set to -1 by default in the latest assets on NGC nvidia/clara.

To figure out the time spent in each component of the pipeline, Clara Console should display that info, though it is not fully integrated yet. In the meantime, need to use K8 commands to get the container logs for execution elapsed time in each operator. From my testing, a CT chest scan with over 500 slices, took around 20 secs for the segmentation and 5 secs for the Classification.

Regards

Good Morning,

what we are noticing is a strange behavior. Let me explain:

  • first run with 400 dcm covid images of patient 1: about 27 seconds total;
  • wait for the previous job to finish and second run with 400 dcm covid images of patient 2: about 40 seconds total;
  • wait for the previous job to finish and third run with 400 dcm covid images of patient 3: about 2 minutes;
  • wait for the previous job to finish and fourth run with 400 dcm covid images of patient 1 again: about 5 minutes;
  • wait for the previous job to finish and fifth run with 400 dcm covid images of patient 2 again: “Error: Make sure clara platform is up and running”.

At this point, whatever command we give, receive: “Error: Make sure clara platform is up and running”.

So, we reboot, and:

  • run with 400 dcm covid images of patient 1: after about 5 minutes, again “Error: Make sure clara platform is up and running”.

So, we delete the folders in “/clara/payload”, reboot and:

  • run with 400 dcm covid images of patient 1: about 27 seconds total again.

Any idea about it? Are we doing something wrong? Thanks in advance for your response.

Best Regards

Hi NDev,

Thanks for bringing up the issues.

First, congratulation for having succeeded in running the COVID-19 pipeline with very good response time.

The doubling of latency has not been observed by our QA when testing the pipelines on AWS. So it is not possible to pinpoint the potential cause of the problem without knowing the details.

Was Clara Deploy installed on the system partition, and size?

Also, we’d be able to assist if you can grab the platfor version and container tags, as well as logs of the platform PODs and pipeline containers, specifically

  • Clara Deploy version and container tag
  • Logs of DICOM Apator POD
  • Logs Platform Server POD
  • Logs Containers in the COVID-19 pipeline job POD

Best regards

dicom.log (3.3 KB) platform.log (1.0 MB) Good Morning,

here some checks.

clara version:
Version: 0.5.0-10383

Filesystem Size Used Avail Use% Mounted on
udev 94G 0 94G 0% /dev
tmpfs 19G 2.6M 19G 1% /run
/dev/nvme1n1p1 33G 27G 6.4G 81% /
tmpfs 94G 0 94G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 94G 0 94G 0% /sys/fs/cgroup
/dev/loop0 97M 97M 0 100% /snap/core/9436
/dev/loop1 29M 29M 0 100% /snap/amazon-ssm-agent/2012
/dev/loop2 98M 98M 0 100% /snap/core/9289
/dev/loop3 18M 18M 0 100% /snap/amazon-ssm-agent/1566
tmpfs 19G 0 19G 0% /run/user/1000

docker image ls:
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/clara/ai-covid-19 0.6.0-2006.3 7d04471a1b11 3 weeks ago 1.82GB
nvcr.io/nvidia/clara/ai-lung 0.6.0-2006.3 11d4513f608b 3 weeks ago 2.53GB
busybox latest 1c35c4412082 3 weeks ago 1.22MB
bitnami/minideb stretch 5b33a3f54fe6 6 weeks ago 53.7MB
k8s.gcr.io/kube-apiserver v1.15.12 c81971987f04 7 weeks ago 207MB
k8s.gcr.io/kube-controller-manager v1.15.12 7b4d4985877a 7 weeks ago 159MB
k8s.gcr.io/kube-proxy v1.15.12 00206e1127f2 7 weeks ago 82.5MB
k8s.gcr.io/kube-scheduler v1.15.12 196d53938faa 7 weeks ago 81.2MB
nvcr.io/nvidia/clara/monitor-server 0.5.0-2004.7 60888b39a985 2 months ago 577MB
nvcr.io/nvidia/clara/register-results 0.5.0-2004.7 a73d170a22a4 2 months ago 191MB
nvcr.io/nvidia/clara/dicom-writer 0.5.0-2004.7 7f3bef6b28c0 2 months ago 518MB
nvcr.io/nvidia/clara/dicom-reader 0.5.0-2004.7 9d901610a57b 2 months ago 518MB
nvcr.io/nvidia/clara/renderserver_ng 0.5.0-2004.7 ba3362e81f19 2 months ago 161MB
nvcr.io/nvidia/clara/resultsservice 0.5.0-2004.7 0b670c106793 2 months ago 284MB
nvcr.io/nvidia/clara/dicomadapter 0.5.0-2004.7 563aedef450e 2 months ago 254MB
nvcr.io/nvidia/clara/claraux-backend 0.5.0-2004.7 49021de23bb8 2 months ago 248MB
nvcr.io/nvidia/clara/claraux-frontend 0.5.0-2004.7 b7b5107acc30 2 months ago 102MB
nvcr.io/nvidia/clara/platformapiserver 0.5.0-2004.7 666aab6ac8b8 2 months ago 203MB
nvcr.io/nvidia/clara/clara-dashboard 0.5.0-2004.7 2cbaad25f33c 2 months ago 405MB
nvcr.io/nvidia/clara/podmanager 0.5.0-2004.7 fb10d88594b6 3 months ago 336MB
nvcr.io/nvidia/clara/clara-datasetservice 0.5.0-2004.7 3b95941cdbd2 3 months ago 1.06GB
bitnami/mongodb 4.0.13-debian-9-r22 2173506821fd 7 months ago 381MB
grafana/grafana 6.3.3 a6e14b4109af 10 months ago 253MB
nvcr.io/nvidia/tensorrtserver 19.08-py3 951d0400284e 10 months ago 7.73GB
quay.io/fluentd_elasticsearch/fluentd v2.7.0 ca9e624ea9d7 11 months ago 140MB
docker.elastic.co/elasticsearch/elasticsearch 7.2.0 0efa6a3de177 12 months ago 861MB
gcr.io/kubernetes-helm/tiller v2.14.1 ac22eb1f780e 12 months ago 94.2MB
nvidia/k8s-device-plugin 1.0.0-beta 7354b8a31679 13 months ago 63.1MB
quay.io/coreos/flannel v0.11.0-amd64 ff281650a721 17 months ago 52.6MB
k8s.gcr.io/coredns 1.3.1 eb516548c180 17 months ago 40.3MB
k8s.gcr.io/etcd 3.3.10 2c4adeb21b4f 19 months ago 258MB
argoproj/workflow-controller v2.2.1 abcb0c0ba87c 20 months ago 140MB
argoproj/argoui v2.2.1 b4984f9f768b 20 months ago 189MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 2 years ago 742kB

docker container ls:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9c64724dfde3 0efa6a3de177 “/usr/local/bin/dock…” 6 minutes ago Up 6 minutes k8s_elasticsearch_elasticsearch-master-0_default_1acd5956-ab91-4de8-b4e7-74771550b7d7_1
927e01deeb63 0efa6a3de177 “/usr/local/bin/dock…” 6 minutes ago Up 6 minutes k8s_elasticsearch_elasticsearch-master-1_default_bc1ebbe0-4ed9-48e7-938f-234a21674fdf_1
b8c829ca29cf 3b95941cdbd2 “/app/bin/dataset_se…” 6 minutes ago Up 6 minutes k8s_dataset-service-clara-renderer_clara-render-server-clara-renderer-789fcb6cb6-nfd78_default_0558f84c-7b6d-4636-96a1-3727854ba8ed_1
b4d44592f992 2173506821fd “/entrypoint.sh /run…” 6 minutes ago Up 6 minutes k8s_clara-console-mongodb_clara-console-mongodb-85f8bd5f95-5k948_default_ea634943-92fd-4d05-b8e2-a86e3c7e939f_10
da49ef2a7d00 ba3362e81f19 “/tmp/run.sh” 6 minutes ago Up 6 minutes k8s_renderer-clara-renderer_clara-render-server-clara-renderer-789fcb6cb6-nfd78_default_0558f84c-7b6d-4636-96a1-3727854ba8ed_1
55b8d8e79a6c 2cbaad25f33c “docker-entrypoint.s…” 6 minutes ago Up 6 minutes k8s_ui-clara-renderer_clara-render-server-clara-renderer-789fcb6cb6-nfd78_default_0558f84c-7b6d-4636-96a1-3727854ba8ed_1
2ba0e3fca7bc k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_clara-render-server-clara-renderer-789fcb6cb6-nfd78_default_0558f84c-7b6d-4636-96a1-3727854ba8ed_3
342f3022295e ac22eb1f780e “/tiller” 6 minutes ago Up 6 minutes k8s_tiller_tiller-deploy-7bf78cdbf7-f4g9s_kube-system_6de4c884-b88e-4052-94fe-2fd1e04de0d7_13
97a6bfeb7775 951d0400284e “trtserver --model-s…” 6 minutes ago Up 6 minutes k8s_fd7219e9-88f2-435b-b953-3379b420b307_fd7219e9-trtis-clara-pipesvc-5447f7b549-rpgj7_default_b37de8cd-7ef8-4545-9beb-ae91c8be2758_10
460a6eb7ba15 0b670c106793 “dotnet /opt/nvidia/…” 6 minutes ago Up 6 minutes k8s_clara_clara-resultsservice-bcd9ff49d-h9vcb_default_75966131-ee15-46c0-8c93-97ff96ac9139_1
33077e720a34 eb516548c180 “/coredns -conf /etc…” 6 minutes ago Up 6 minutes k8s_coredns_coredns-5c98db65d4-78sgf_kube-system_68c0ffce-bf60-47f6-a4fc-be28f4a65608_13
50c0f26cb5b0 eb516548c180 “/coredns -conf /etc…” 6 minutes ago Up 6 minutes k8s_coredns_coredns-5c98db65d4-wdhhn_kube-system_0af1853e-fd92-4b6c-8557-21c068bb4fe1_13
0e68f7a85668 b4984f9f768b “/bin/sh -c 'node ap…” 6 minutes ago Up 6 minutes k8s_ui_clara-ui-6f89b97df8-d7dgw_default_bf99ee0b-def9-4d91-a94f-d8d373249d9b_10
268856b865bd 60888b39a985 “python3 server.py -…” 6 minutes ago Up 6 minutes k8s_monitor-server_clara-monitor-server-monitor-server-868c5fcf89-l4bhl_default_e5b2f497-803c-4f1f-81ae-60644e47a772_10
3ff5a9fb7517 563aedef450e “dotnet /opt/nvidia/…” 6 minutes ago Up 6 minutes k8s_dicom-adapter_clara-dicom-adapter-77677c7788-fpjm2_default_25cd19b2-7e67-41ca-80e9-d62d13e514d4_7
62a689c00a45 k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_tiller-deploy-7bf78cdbf7-f4g9s_kube-system_6de4c884-b88e-4052-94fe-2fd1e04de0d7_67
ec039839cd6a 7354b8a31679 “nvidia-device-plugin” 6 minutes ago Up 6 minutes k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-9mj4f_kube-system_c708fb22-3a7d-4d4d-b83a-e7a8a399985f_13
a1e1c3111578 ca9e624ea9d7 “/run.sh” 6 minutes ago Up 6 minutes k8s_clara-monitor-server-fluentd-elasticsearch_clara-monitor-server-fluentd-elasticsearch-xmwpp_default_5b7edf18-8a43-470f-9dec-338e921732ad_10
7a32d27b6a5b k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_fd7219e9-trtis-clara-pipesvc-5447f7b549-rpgj7_default_b37de8cd-7ef8-4545-9beb-ae91c8be2758_39
29946da534db 49021de23bb8 “npm run start” 6 minutes ago Up 6 minutes k8s_backend_clara-ux-77c9b96ccb-mwbtd_default_7346f3ee-3d14-4c35-9a23-3e01f4fd2339_11
6c8abf2807da k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_clara-resultsservice-bcd9ff49d-h9vcb_default_75966131-ee15-46c0-8c93-97ff96ac9139_1
368e328a9302 b7b5107acc30 “nginx -g 'daemon of…” 6 minutes ago Up 6 minutes k8s_frontend_clara-ux-77c9b96ccb-mwbtd_default_7346f3ee-3d14-4c35-9a23-3e01f4fd2339_10
8d2d60de7ab7 abcb0c0ba87c “workflow-controller…” 6 minutes ago Up 6 minutes k8s_controller_clara-workflow-controller-69cbb55fc8-hs2nr_default_949b9352-6e4d-484b-badd-f1822913300d_10
f2cff3126bc4 k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_clara-console-mongodb-85f8bd5f95-5k948_default_ea634943-92fd-4d05-b8e2-a86e3c7e939f_43
aa4cf2a7a465 k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_coredns-5c98db65d4-78sgf_kube-system_68c0ffce-bf60-47f6-a4fc-be28f4a65608_61
5dde9c26b62b 666aab6ac8b8 “dotnet Nvidia.Clara…” 6 minutes ago Up 6 minutes k8s_platformapiserver_clara-clara-platformapiserver-7944594fc6-4rvk7_default_f382d961-2492-4a52-8727-6dda7a6a3be6_1
d4c2d13e40bd k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_coredns-5c98db65d4-wdhhn_kube-system_0af1853e-fd92-4b6c-8557-21c068bb4fe1_58
c13c8377f9f4 k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_elasticsearch-master-0_default_1acd5956-ab91-4de8-b4e7-74771550b7d7_1
e7e93b40e5ab a6e14b4109af “/run.sh” 6 minutes ago Up 6 minutes k8s_grafana_clara-monitor-server-grafana-5f874b974d-5knrq_default_42c7c67f-58b8-4a47-a70b-e9e9efea24cc_10
9311eb3f9f03 k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_clara-ui-6f89b97df8-d7dgw_default_bf99ee0b-def9-4d91-a94f-d8d373249d9b_54
1265ff6aded8 k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_clara-monitor-server-monitor-server-868c5fcf89-l4bhl_default_e5b2f497-803c-4f1f-81ae-60644e47a772_43
542103df0953 k8s.gcr.io/pause:3.1 “/pause” 6 minutes ago Up 6 minutes k8s_POD_clara-dicom-adapter-77677c7788-fpjm2_default_25cd19b2-7e67-41ca-80e9-d62d13e514d4_23
590191de1487 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 6 minutes k8s_POD_clara-monitor-server-fluentd-elasticsearch-xmwpp_default_5b7edf18-8a43-470f-9dec-338e921732ad_37
505e1e8b2228 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 6 minutes k8s_POD_nvidia-device-plugin-daemonset-9mj4f_kube-system_c708fb22-3a7d-4d4d-b83a-e7a8a399985f_61
e71b34e4b65d k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 6 minutes k8s_POD_elasticsearch-master-1_default_bc1ebbe0-4ed9-48e7-938f-234a21674fdf_1
999ab801244f k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 6 minutes k8s_POD_clara-ux-77c9b96ccb-mwbtd_default_7346f3ee-3d14-4c35-9a23-3e01f4fd2339_45
92497de41bd4 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 6 minutes k8s_POD_clara-workflow-controller-69cbb55fc8-hs2nr_default_949b9352-6e4d-484b-badd-f1822913300d_43
a72ca6e5f73f k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 6 minutes k8s_POD_clara-clara-platformapiserver-7944594fc6-4rvk7_default_f382d961-2492-4a52-8727-6dda7a6a3be6_1
e525caea84c9 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 6 minutes k8s_POD_clara-monitor-server-grafana-5f874b974d-5knrq_default_42c7c67f-58b8-4a47-a70b-e9e9efea24cc_40
fdd73c63302b ff281650a721 “/opt/bin/flanneld -…” 7 minutes ago Up 7 minutes k8s_kube-flannel_kube-flannel-ds-amd64-qxdk6_kube-system_43d8602d-d4f4-4f80-bdd5-7d62ef4d2b29_17
e76b3eb1e4e5 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 7 minutes k8s_POD_kube-flannel-ds-amd64-qxdk6_kube-system_43d8602d-d4f4-4f80-bdd5-7d62ef4d2b29_13
afc880b474e7 00206e1127f2 “/usr/local/bin/kube…” 7 minutes ago Up 7 minutes k8s_kube-proxy_kube-proxy-ln9qc_kube-system_08253531-1be0-4967-9733-b967b3cf9238_13
c88bf13cf4c7 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 7 minutes k8s_POD_kube-proxy-ln9qc_kube-system_08253531-1be0-4967-9733-b967b3cf9238_13
c1105a4883cc 2c4adeb21b4f “etcd --advertise-cl…” 7 minutes ago Up 7 minutes k8s_etcd_etcd-ip-172-31-47-129_kube-system_1b7d13868bbff1a590a3c56b7cc2ad75_13
ab936260fb0f c81971987f04 “kube-apiserver --ad…” 7 minutes ago Up 7 minutes k8s_kube-apiserver_kube-apiserver-ip-172-31-47-129_kube-system_ffdae0d3851e2738e9733efa2fc6847a_13
0f85bcd57d91 7b4d4985877a “kube-controller-man…” 7 minutes ago Up 7 minutes k8s_kube-controller-manager_kube-controller-manager-ip-172-31-47-129_kube-system_d47ebfdc6576dacab9d52624b30c259f_13
b86ac8b11133 196d53938faa “kube-scheduler --bi…” 7 minutes ago Up 7 minutes k8s_kube-scheduler_kube-scheduler-ip-172-31-47-129_kube-system_37bbbfb82a966a388adac318f32b758f_13
c71b4cba5c24 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 7 minutes k8s_POD_etcd-ip-172-31-47-129_kube-system_1b7d13868bbff1a590a3c56b7cc2ad75_13
2b1776ef931f k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 7 minutes k8s_POD_kube-scheduler-ip-172-31-47-129_kube-system_37bbbfb82a966a388adac318f32b758f_13
018aa934ca34 k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 7 minutes k8s_POD_kube-controller-manager-ip-172-31-47-129_kube-system_d47ebfdc6576dacab9d52624b30c259f_13
308bb59ed4ad k8s.gcr.io/pause:3.1 “/pause” 7 minutes ago Up 7 minutes k8s_POD_kube-apiserver-ip-172-31-47-129_kube-system_ffdae0d3851e2738e9733efa2fc6847a_13

platform.log (1.0 MB)

Thank you very much for providing the details.

The version of the Platform on your system is the older V 0.5. A new version of the Clara Deploy V 0.6, was published on NGC, org/team being nvidia/clara, on June 23rd. Please download, install and run this new version.

In the meantime, I have looked into the platform log and found the exception occurring numerous times consistently through out.
at Nvidia.Clara.Platform.Services.K8sJobsRepository.GetJob(Guid jobId, Job& job) in /src/Api/Server/Repositories/K8sJobsRepository.cs:line 110
The statement causing the exception itself is not that important, for for tracing, but the exception may cause the delay releasing or unleasing of a lock. This part of the code had been refactored/removed in the new V 0.6.

Best Regards.

Good Morning,

we’ve tested v 0.6: there is one positive and one negative thing.

Positive - In the first runs the result was obtained in a couple of seconds, instead of 27!
Negative - In the long run, the behavior is the same. That is, as images are submitted, the run takes longer, up to 8 minutes and then gives the error (“Error: Make sure clara platform is up and running”).

Also, another new problem is that the “clara describe job <JOB_ID>” command never works (even in the first runs), and returns the error “Error: Code: -1280, Unhandled application exception: UnknownJobIdentityException => “job identity {00000000000000000000000000000000} does not match any known job.”.”

We tried to get the logs, but they are not present (“kubectl logs” does not return anything).

Any idea? Thanks in advance for your response.

Best Regards

Thanks for the feedback. It is great to know the high performance in your setup.

I’ve request internal resources to help with two issues, as they appear to be in the Platform API.

Best Regards