Error - Failed to Start Job -8451 - when running an image through the pipeline

Trying to run an image through the pipeline
Getting this error when i try to start the job and cant get to http://localhost:8000

km:/etc/clara/pipelines/clara_ai_chestxray_pipeline$ clara start job -j 8def1bbbbe6e475a90efe753ac65539d
Unable to create job. Error:Code: -8451, Failed to start job.;Unhandled application exception: ValidationException => "'name' cannot be null.".

pipeline yaml file details are below

:/etc/clara/pipelines/clara_ai_chestxray_pipeline$ clara describe pipelines c302979bbc2c42aeacc249787a59fa6d
api-version: 0.3.0
name: chestxray-pipeline
pull-secrets:
  - ngc-clara
operators:
  - name: ai-app-chestxray
    description: Classifying Chest X-ray Images
    container:
      image: nvcr.io/nvidia/clara/ai-chestxray
      tag: 0.6.0-2006.4
    input:
    - path: /input
    output:
    - path: /output
    services:
    - name: trtis
      container:
        image: nvcr.io/nvidia/tensorrtserver
        tag: 19.08-py3
        command: ["trtserver", "--model-store=$(NVIDIA_CLARA_SERVICE_DATA_PATH)/models"]
      requests:
        gpu: 1
      connections:
        http:
        - name: NVIDIA_CLARA_TRTISURI
          port: 8000

Hello @portal,

Thank you for reporting the issue!
Could you let us know which version of Clara are you using?
Please share the output of clara version.
And it would be nice if you can find/share any other error messages from api-server pod’s log.

$ clara version
0.6.1-aaeed4e5.11796

$ kubectl get pods

NAME                                                  READY   STATUS      RESTARTS   AGE
clara-clara-platformapiserver-5bb4f5f697-f25xb        1/1     Running     0          17h
clara-console-5687765d8b-6xvjh                        2/2     Running     0          17h
clara-console-mongodb-85f8bd5f95-2tw6r                1/1     Running     0          17h
clara-pipesvc-402bab40-triton-7cd8646cbf-d27pr        1/1     Running     0          17h
clara-render-server-clara-renderer-58dc878d78-6jnrx   3/3     Running     0          17h
clara-resultsservice-f675d586d-dhgjk                  1/1     Running     0          17h
clara-ui-6f89b97df8-7q84p                             1/1     Running     0          17h
clara-workflow-controller-69cbb55fc8-w7qqc            1/1     Running     0          17h

$ kubectl logs clara-clara-platformapiserver-5bb4f5f697-f25xb --tail=100
(log messages for last 100 lines...)

Thank you!

kmaster@UbuntuImage:/etc/clara/pipelines/clara_ai_chestxray_pipeline$ clara version
0.6.0-11245
kmaster@UbuntuImage:/etc/clara/pipelines/clara_ai_chestxray_pipeline$  kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
clara-clara-platformapiserver-74b9bcb88-cp2pj          1/1     Running   16         5d
clara-console-57cddb95c8-4vvgh                         2/2     Running   8          5d
clara-console-mongodb-85f8bd5f95-npbtx                 1/1     Running   4          5d
clara-dicom-adapter-f54bbfd97-kg8gc                    1/1     Running   2          3d21h
clara-monitor-server-fluentd-elasticsearch-86klm       1/1     Running   13         5d
clara-monitor-server-grafana-5f874b974d-wlw2k          1/1     Running   4          5d
clara-monitor-server-monitor-server-84446ccf85-cxzzc   1/1     Running   3          5d
clara-render-server-clara-renderer-794965cc7d-wcbwt    3/3     Running   9          5d
clara-resultsservice-6cfdb45846-7c4sk                  1/1     Running   3          5d
clara-ui-6f89b97df8-5zfph                              1/1     Running   3          5d
clara-workflow-controller-69cbb55fc8-xsdxt             1/1     Running   3          5d
elasticsearch-master-0                                 1/1     Running   3          5d
elasticsearch-master-1                                 1/1     Running   3          5d
fd7219e9-trtis-clara-pipesvc-57d6486fbf-nzsd6          1/1     Running   3          5d1h
kmaster@UbuntuImage:/etc/clara/pipelines/clara_ai_chestxray_pipeline$ kubectl logs clara-clara-platformapiserver-74b9bcb88-cp2pj
2020 Jul 08 19:09:04.214 Starting Clara Platform Server
2020 Jul 08 19:09:04.22   Host:                      0.0.0.0
2020 Jul 08 19:09:04.22   Port:                      50051
2020 Jul 08 19:09:04.22   Resolver:                  Clara
2020 Jul 08 19:09:04.22   Repositories:              K8s
2020 Jul 08 19:09:04.22   Storage:                   Disk (/clara/payloads)
2020 Jul 08 19:09:04.221   ExecutorSelector:          Clara
2020 Jul 08 19:09:04.221   Service Deployer:          K8s
2020 Jul 08 19:09:04.221   Inference Server Deployer: K8s
2020 Jul 08 19:09:04.221   Service Volume:            Disk (/clara/service-volumes)
2020 Jul 08 19:09:04.221   Inference Server Volume:   Disk (/clara/trtis)
2020 Jul 08 19:09:04.222   Common Volume:             K8s (clara-platformapiserver-common-volume-claim)
2020 Jul 08 19:09:04.224   Trace Listeners:           Console, Clara
2020 Jul 08 19:09:04.228 Added resource provider: resources = trtis.
2020 Jul 08 19:09:04.228 Added resource provider: resources = gpu.
2020 Jul 08 19:09:04.229 Scheduler started.
2020 Jul 08 19:09:04.478 Server (0.0.0.0:50051) started.
2020 Jul 08 19:09:04.479 Controller started
2020 Jul 08 19:09:04.48 QueueWorker in Job Controller started.
Press Ctrl-C to quit
2020 Jul 08 21:03:47.719 Job list request results: 0 records returned, 0 records filtered.
2020 Jul 08 21:04:40.885 Job list request results: 0 records returned, 0 records filtered.
2020 Jul 08 21:07:18.236 Job list request results: 0 records returned, 0 records filtered.

Can you please paste the output of the following commands?

kubectl get deploy
kubectl get psvc

Hi,

I was able to reproduce the error and debug the issue.
This happens when the clara state(in this case records of pipeline services) gets corrupted due to a broken install, uninstall or upgrade.
Looking at the output of the commands that you posted, I see that you have probably redeployed clara.

In case you’re still facing the issue,
Please do the following to kill the tritis service:

kubectl get all | grep pipesvc | egrep 'deployment|service' | awk '{print $1}' | xargs kubectl delete

After this, please try running a job again. Will be happy to look into any further issues.

thanks , that made the job start…but…when i go to local host i get the job failed - with the following error tried running this 4 times and same error

OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: unknown error\\n\”””: unknown

the yaml looks like below

container:
env:
-
name: NVIDIA_CLARA_INPUT_MAP
value: ‘/input:input::’
-
name: NVIDIA_CLARA_PIPELINE_ID
value: 4038551f49b747dd8611e9e404a019d2
-
name: NVIDIA_CLARA_JOB_ID
value: 79e255d0cd41476f98228535e117a8a5
-
name: NVIDIA_CLARA_PAYLOAD_ID
value: a4710335abc6440b85fb71fcda0d5836
-
name: NVIDIA_CLARA_NOSYNCLOCK
value: ‘TRUE’
-
name: NVIDIA_CLARA_TRTISURI
value: ‘10.103.214.18:8000’
image: ‘app_covidxray:latest’
imagePullPolicy: IfNotPresent
name: app-covidxray
resources: {}
volumeMounts:
-
mountPath: /output
name: payload-volume
subPath: operators/app-covidxray
-
mountPath: /input
name: payload-volume
readOnly: true
subPath: input
inputs: {}
metadata:
labels:
job-id: 79e255d0-cd41-476f-9822-8535e117a8a5
operator-name: app-covidxray
name: app-covidxray
outputs: {}
dag:
tasks:
-
arguments: {}
name: app-covidxray
template: app-covidxray
inputs: {}
metadata: {}
name: clara-pipeline-entrypoint
outputs: {}

the pods looks like working but not all

kmaster@UbuntuImage:/etc/clara/pipelines/clara_ai_chestxray_pipeline$ kubectl get pods
NAME                                                   READY   STATUS               RESTARTS   AGE
chestxray-test-tm7gc-2612269396                        0/2     ContainerCannotRun   0          44m
clara-clara-platformapiserver-74b9bcb88-cp2pj          1/1     Running              16         5d5h
clara-console-57cddb95c8-4vvgh                         2/2     Running              8          5d4h
clara-console-mongodb-85f8bd5f95-npbtx                 1/1     Running              4          5d4h
clara-dicom-adapter-f54bbfd97-kg8gc                    1/1     Running              2          4d1h
clara-monitor-server-fluentd-elasticsearch-86klm       1/1     Running              13         5d5h
clara-monitor-server-grafana-5f874b974d-wlw2k          1/1     Running              4          5d5h
clara-monitor-server-monitor-server-84446ccf85-cxzzc   1/1     Running              3          5d5h
clara-render-server-clara-renderer-794965cc7d-wcbwt    3/3     Running              9          5d5h
clara-resultsservice-6cfdb45846-7c4sk                  1/1     Running              3          5d5h
clara-ui-6f89b97df8-5zfph                              1/1     Running              3          5d5h
clara-workflow-controller-69cbb55fc8-xsdxt             1/1     Running              3          5d5h
covidxray-test-84kq5-898999818                         0/2     ContainerCannotRun   0          29m
covidxray-test-8dbc6-3275910894                        0/2     ContainerCannotRun   0          8m10s
covidxray-test-hwjlc-1448011773                        0/2     ContainerCannotRun   0          36m
elasticsearch-master-0                                 1/1     Running              3          5d5h
elasticsearch-master-1                                 1/1     Running              3          5d5h
fd7219e9-trtis-clara-pipesvc-989dbf996-rrj94           0/1     CrashLoopBackOff     6          5m20s

i re ran your command to delete and again restart a new job…but the container is failing – and the jobs cant run…

kmaster@UbuntuImage:/etc/clara/pipelines/clara_ai_chestxray_pipeline$ kubectl get pods
NAME                                                   READY   STATUS               RESTARTS   AGE
chestxray-test-tm7gc-2612269396                        0/2     ContainerCannotRun   0          51m
clara-clara-platformapiserver-74b9bcb88-cp2pj          1/1     Running              16         5d5h
clara-console-57cddb95c8-4vvgh                         2/2     Running              8          5d5h
clara-console-mongodb-85f8bd5f95-npbtx                 1/1     Running              4          5d5h
clara-dicom-adapter-f54bbfd97-kg8gc                    1/1     Running              2          4d1h
clara-monitor-server-fluentd-elasticsearch-86klm       1/1     Running              13         5d5h
clara-monitor-server-grafana-5f874b974d-wlw2k          1/1     Running              4          5d5h
clara-monitor-server-monitor-server-84446ccf85-cxzzc   1/1     Running              3          5d5h
clara-render-server-clara-renderer-794965cc7d-wcbwt    3/3     Running              9          5d5h
clara-resultsservice-6cfdb45846-7c4sk                  1/1     Running              3          5d5h
clara-ui-6f89b97df8-5zfph                              1/1     Running              3          5d5h
clara-workflow-controller-69cbb55fc8-xsdxt             1/1     Running              3          5d5h
covidxray-test-84kq5-898999818                         0/2     ContainerCannotRun   0          36m
covidxray-test-8dbc6-3275910894                        0/2     ContainerCannotRun   0          15m
covidxray-test-8gmwt-1580145162                        0/2     ContainerCreating    0          2m
covidxray-test-hwjlc-1448011773                        0/2     ContainerCannotRun   0          43m
elasticsearch-master-0                                 1/1     Running              3          5d5h
elasticsearch-master-1                                 1/1     Running              3          5d5h
fd7219e9-trtis-clara-pipesvc-6756f48f89-h42mb          0/1     RunContainerError    0          2m2s

Hi @portal, Could you share the description of failing pod/service to see what caused the error?

kubectl describe pods/covidxray-test-8dbc6-3275910894
kubectl describe pods/fd7219e9-trtis-clara-pipesvc-6756f48f89-h42mb

Name: covidxray-test-8dbc6-3275910894
Namespace: default
Priority: 0
Node: ubuntuimage/192.168.1.218
Start Time: Wed, 08 Jul 2020 21:16:33 -0500
Labels: job-id=79e255d0-cd41-476f-9822-8535e117a8a5
operator-name=app-covidxray
workflows.argoproj.io/completed=true
workflows.argoproj.io/workflow=covidxray-test-8dbc6
Annotations: workflows.argoproj.io/execution: {“deadline”:“2020-07-09T02:23:20Z”}
workflows.argoproj.io/node-name: covidxray-test-8dbc6.app-covidxray
workflows.argoproj.io/template:
{“name”:“app-covidxray”,“inputs”:{},“outputs”:{},“metadata”:{“labels”:{“job-id”:“79e255d0-cd41-476f-9822-8535e117a8a5”,“operator-name”:"ap…
Status: Failed
IP: 10.244.0.231
Controlled By: Workflow/covidxray-test-8dbc6
Containers:
main:
Container ID: docker://a63ee0c2f097f17130c7c271a951aa48085de4152ca9857c5813a0a87360cb18
Image: app_covidxray:latest
Image ID: docker://sha256:f811e08e2a9de8246f7d7c58f28a050d441f9799b0881453e8307798c89dfb73
Port:
Host Port:
State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: unknown error\\n\”””: unknown
Exit Code: 128
Started: Wed, 08 Jul 2020 21:16:53 -0500
Finished: Wed, 08 Jul 2020 21:16:53 -0500
Ready: False
Restart Count: 0
Environment:
NVIDIA_CLARA_INPUT_MAP: /input:input::
NVIDIA_CLARA_PIPELINE_ID: 4038551f49b747dd8611e9e404a019d2
NVIDIA_CLARA_JOB_ID: 79e255d0cd41476f98228535e117a8a5
NVIDIA_CLARA_PAYLOAD_ID: a4710335abc6440b85fb71fcda0d5836
NVIDIA_CLARA_NOSYNCLOCK: TRUE
NVIDIA_CLARA_TRTISURI: 10.103.214.18:8000
Mounts:
/input from payload-volume (ro,path=“input”)
/output from payload-volume (rw,path=“operators/app-covidxray”)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2kchz (ro)
wait:
Container ID: docker://3e87b0df77209d41a490e4ea22814d29e8a6f9393fa6f1bbe486f97e2a26c269
Image: argoproj/argoexec:v2.2.1
Image ID: docker-pullable://argoproj/argoexec@sha256:9b12553aa7dccddc88c766d3dd59f4e8758acbd82ceef9e7aedc75f09934480a
Port:
Host Port:
Command:
argoexec
Args:
wait
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 08 Jul 2020 21:17:39 -0500
Finished: Wed, 08 Jul 2020 21:17:39 -0500
Ready: False
Restart Count: 0
Environment:
ARGO_POD_NAME: covidxray-test-8dbc6-3275910894 (v1:metadata.name)
Mounts:
/argo/podmetadata from podmetadata (rw)
/var/lib/docker from docker-lib (ro)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2kchz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
podmetadata:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
docker-lib:
Type: HostPath (bare host directory volume)
Path: /var/lib/docker
HostPathType: Directory
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType: Socket
payload-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: a4710335-abc6-440b-85fb-71fcda0d5836-claim
ReadOnly: false
default-token-2kchz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-2kchz
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:

kmaster@UbuntuImage:/etc/clara/pipelines/clara_ai_chestxray_pipeline$ kubectl describe pods/fd7219e9-trtis-clara-pipesvc-6756f48f89-h42mb
Name: fd7219e9-trtis-clara-pipesvc-6756f48f89-h42mb
Namespace: default
Priority: 0
Node: ubuntuimage/192.168.1.218
Start Time: Wed, 08 Jul 2020 21:26:20 -0500
Labels: app=fd7219e9-88f2-435b-b953-3379b420b307
pod-template-hash=6756f48f89
Annotations:
Status: Running
IP: 10.244.0.232
Controlled By: ReplicaSet/fd7219e9-trtis-clara-pipesvc-6756f48f89
Containers:
fd7219e9-88f2-435b-b953-3379b420b307:
Container ID: docker://4c20cd4844ce7cfddd2cac133b12076d0481a7181256b7d4057a40bea90412b8
Image: nvcr.io/nvidia/tensorrtserver:19.08-py3
Image ID: docker-pullable://nvcr.io/nvidia/tensorrtserver@sha256:438b6c2ddfd095faf3453f348c8639ea5be0c28a687a604d6f691f07469076c6
Port: 8000/TCP
Host Port: 0/TCP
Command:
trtserver
–model-store=$(NVIDIA_CLARA_SERVICE_DATA_PATH)/models
State: Waiting
Reason: RunContainerError
Last State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 176
Environment:
NVIDIA_CLARA_SERVICE_DATA_PATH: /732c84da-2d56-4a3e-b4a4-4ea64276469a
Mounts:
/732c84da-2d56-4a3e-b4a4-4ea64276469a from servicecommonvolume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2kchz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
servicecommonvolume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: clara-platformapiserver-common-volume-claim
ReadOnly: false
default-token-2kchz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-2kchz
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message


Normal Pulled 16m (x174 over 11h) kubelet, ubuntuimage Container image “nvcr.io/nvidia/tensorrtserver:19.08-py3” already present on machine
Warning BackOff 63s (x2918 over 11h) kubelet, ubuntuimage Back-off restarting failed container

Hi @portal,
Thanks for sharing the information!

Seeing

OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused “process_linux.go:432: running prestart hook 0 caused \“error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: cuda error: unknown error\n\”””: unknown

I think your NVIDIA GPU/nvidia-container setup is massed up:

If you didn’t reboot after installing GPU driver/CUDA toolkit, it can cause the issue. Please reboot the system and execute the pipeline again.

You can also check if prerequisites are properly installed (please share the output of this)

# 1) local check
nvidia-smi
# 2) check inside docker
docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi
# 3) nvidia-container-cli
nvidia-container-cli -k -d /dev/tty info
# 4) docker daemon setup
cat /etc/docker/daemon.json

Thanks!

kmaster@UbuntuImage:~$ nvidia-smi
Thu Jul  9 15:35:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   60C    P3    19W /  N/A |    439MiB /  6078MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3869      G   /usr/lib/xorg/Xorg                           271MiB |
|    0      4189      G   /usr/bin/gnome-shell                         104MiB |
|    0      8811      G   ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files    59MiB |
+-----------------------------------------------------------------------------+
kmaster@UbuntuImage:~$ sudo docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi
Thu Jul  9 20:36:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P5     8W /  N/A |    457MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
kmaster@UbuntuImage:~$ nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0709 20:36:32.475021 15049 nvc.c:281] initializing library context (version=1.1.1, build=e5d6156aba457559979597c8e3d22c5d8d0622db)
I0709 20:36:32.475054 15049 nvc.c:255] using root /
I0709 20:36:32.475062 15049 nvc.c:256] using ldcache /etc/ld.so.cache
I0709 20:36:32.475068 15049 nvc.c:257] using unprivileged user 1000:1000
W0709 20:36:32.477319 15052 nvc.c:186] failed to set inheritable capabilities
W0709 20:36:32.477345 15052 nvc.c:187] skipping kernel modules load due to failure
I0709 20:36:32.477447 15053 driver.c:101] starting driver service
I0709 20:36:32.478871 15049 nvc_info.c:541] requesting driver information with ''
I0709 20:36:32.479882 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.440.100
I0709 20:36:32.479958 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.440.100
I0709 20:36:32.480005 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.440.100
I0709 20:36:32.480070 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.440.100
I0709 20:36:32.480160 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.440.100
I0709 20:36:32.480232 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.440.100
I0709 20:36:32.480285 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.440.100
I0709 20:36:32.480339 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.440.100
I0709 20:36:32.480383 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.440.100
I0709 20:36:32.480413 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.440.100
I0709 20:36:32.480441 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.440.100
I0709 20:36:32.480470 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.440.100
I0709 20:36:32.480512 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.440.100
I0709 20:36:32.480541 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.440.100
I0709 20:36:32.480583 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.440.100
I0709 20:36:32.480611 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.440.100
I0709 20:36:32.480639 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.440.100
I0709 20:36:32.480680 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.440.100
I0709 20:36:32.480710 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.440.100
I0709 20:36:32.480751 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.440.100
I0709 20:36:32.480909 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.440.100
I0709 20:36:32.480995 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.440.100
I0709 20:36:32.481026 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.440.100
I0709 20:36:32.481056 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.440.100
I0709 20:36:32.481088 15049 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.440.100
I0709 20:36:32.481128 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.440.100
I0709 20:36:32.481158 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.440.100
I0709 20:36:32.481202 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.440.100
I0709 20:36:32.481243 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.440.100
I0709 20:36:32.481272 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.440.100
I0709 20:36:32.481317 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-ifr.so.440.100
I0709 20:36:32.481361 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.440.100
I0709 20:36:32.481391 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.440.100
I0709 20:36:32.481421 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.440.100
I0709 20:36:32.481451 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.440.100
I0709 20:36:32.481493 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-fatbinaryloader.so.440.100
I0709 20:36:32.481524 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.440.100
I0709 20:36:32.481565 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.440.100
I0709 20:36:32.481594 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.440.100
I0709 20:36:32.481626 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.440.100
I0709 20:36:32.481676 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libcuda.so.440.100
I0709 20:36:32.481725 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.440.100
I0709 20:36:32.481757 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.440.100
I0709 20:36:32.481787 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.440.100
I0709 20:36:32.481817 15049 nvc_info.c:155] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.440.100
W0709 20:36:32.481836 15049 nvc_info.c:306] missing library libvdpau_nvidia.so
W0709 20:36:32.481841 15049 nvc_info.c:310] missing compat32 library libnvidia-cfg.so
W0709 20:36:32.481847 15049 nvc_info.c:310] missing compat32 library libnvidia-allocator.so
W0709 20:36:32.481853 15049 nvc_info.c:310] missing compat32 library libvdpau_nvidia.so
W0709 20:36:32.481858 15049 nvc_info.c:310] missing compat32 library libnvidia-rtcore.so
W0709 20:36:32.481863 15049 nvc_info.c:310] missing compat32 library libnvoptix.so
W0709 20:36:32.481869 15049 nvc_info.c:310] missing compat32 library libnvidia-cbl.so
I0709 20:36:32.482135 15049 nvc_info.c:236] selecting /usr/bin/nvidia-smi
I0709 20:36:32.482152 15049 nvc_info.c:236] selecting /usr/bin/nvidia-debugdump
I0709 20:36:32.482169 15049 nvc_info.c:236] selecting /usr/bin/nvidia-persistenced
I0709 20:36:32.482184 15049 nvc_info.c:236] selecting /usr/bin/nvidia-cuda-mps-control
I0709 20:36:32.482200 15049 nvc_info.c:236] selecting /usr/bin/nvidia-cuda-mps-server
I0709 20:36:32.482221 15049 nvc_info.c:373] listing device /dev/nvidiactl
I0709 20:36:32.482225 15049 nvc_info.c:373] listing device /dev/nvidia-uvm
I0709 20:36:32.482228 15049 nvc_info.c:373] listing device /dev/nvidia-uvm-tools
I0709 20:36:32.482231 15049 nvc_info.c:373] listing device /dev/nvidia-modeset
I0709 20:36:32.482252 15049 nvc_info.c:277] listing ipc /run/nvidia-persistenced/socket
W0709 20:36:32.482263 15049 nvc_info.c:281] missing ipc /tmp/nvidia-mps
I0709 20:36:32.482268 15049 nvc_info.c:598] requesting device information with ''
I0709 20:36:32.488169 15049 nvc_info.c:637] listing device /dev/nvidia0 (GPU-0f002fcf-4ec4-b87d-2388-2fb255ab68f4 at 00000000:01:00.0)
NVRM version:   440.100
CUDA version:   10.2

Device Index:   0
Device Minor:   0
Model:          GeForce GTX 1060 with Max-Q Design
Brand:          GeForce
GPU UUID:       GPU-0f002fcf-4ec4-b87d-2388-2fb255ab68f4
Bus Location:   00000000:01:00.0
Architecture:   6.1
I0709 20:36:32.488194 15049 nvc.c:318] shutting down library context
I0709 20:36:32.488424 15053 driver.c:156] terminating driver service
I0709 20:36:32.488654 15049 driver.c:196] driver service terminated successfully
kmaster@UbuntuImage:~$ docker daemon setup
WARNING: Error loading config file: /home/kmaster/.docker/config.json: open /home/kmaster/.docker/config.json: permission denied
docker: 'daemon' is not a docker command.
See 'docker --help'
kmaster@UbuntuImage:~$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia",
    "dns": ["192.168.1.254","8.8.4.4", "8.8.8.8"]
}
kmaster@UbuntuImage:~$ docker version
WARNING: Error loading config file: /home/kmaster/.docker/config.json: open /home/kmaster/.docker/config.json: permission denied
Client: Docker Engine - Community
 Version:           19.03.12
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        48a66213fe
 Built:             Mon Jun 22 15:45:36 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.12
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       48a66213fe
  Built:            Mon Jun 22 15:44:07 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 nvidia:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

trying to restart the pods for clara

kmaster@UbuntuImage:/etc/clara/bootstrap$ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
clara-clara-platformapiserver-74b9bcb88-8sbdj          0/1     Pending   0          8m46s
clara-console-57cddb95c8-4vvgh                         0/2     Error     8          5d23h
clara-console-mongodb-85f8bd5f95-npbtx                 0/1     Error     4          5d23h
clara-dicom-adapter-f54bbfd97-kg8gc                    0/1     Error     2          4d20h
clara-monitor-server-fluentd-elasticsearch-86klm       0/1     Error     13         5d23h
clara-monitor-server-grafana-5f874b974d-wlw2k          0/1     Error     4          5d23h
clara-monitor-server-monitor-server-84446ccf85-cxzzc   0/1     Error     3          5d23h
clara-render-server-clara-renderer-794965cc7d-wcbwt    0/3     Error     9          5d23h
clara-resultsservice-6cfdb45846-4gzqb                  0/1     Error     0          3h28m
clara-ui-6f89b97df8-rlrj9                              0/1     Error     0          3h28m
clara-workflow-controller-69cbb55fc8-2qk8l             0/1     Error     0          3h28m
elasticsearch-master-0                                 0/1     Error     3          5d23h
elasticsearch-master-1                                 0/1     Error     3          5d23h
kmaster@UbuntuImage:/etc/clara/bootstrap$ clara dicom start
Error: could not find a ready tiller pod
Usage:
  dicom start [flags]

Flags:
  -f, --file string   Custom configuration file
  -h, --help          help for start

Global Flags:
      --config string   config file (default is $HOME/.clara/config.yaml)

could not find a ready tiller pod
kmaster@UbuntuImage:/etc/clara/bootstrap$ clara render start
Error: could not find a ready tiller pod
Usage:
  render start [flags]

Flags:
  -h, --help   help for start

Global Flags:
      --config string   config file (default is $HOME/.clara/config.yaml)
  -v, --verbose         verbose output

Error: could not find a ready tiller pod
kmaster@UbuntuImage:/etc/clara/bootstrap$ helm ls
Error: could not find a ready tiller pod
kmaster@UbuntuImage:/etc/clara/bootstrap$ sudo ./bootstrap.sh 
2020-07-09 20:53:38 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-07-09 20:53:38 [INFO]: Checking user privilege...
 
2020-07-09 20:53:38 [INFO]: Checking for NVIDIA GPU driver...
2020-07-09 20:53:38 [INFO]: NVIDIA CUDA driver version found: 440.100
2020-07-09 20:53:38 [INFO]: NVIDIA GPU driver found
2020-07-09 20:53:38 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release dirmngr ...
Get:1 https://nvidia.github.io/libnvidia-container/ubuntu18.04/amd64  InRelease [1,146 B]
Hit:2 https://download.docker.com/linux/ubuntu bionic InRelease                                                                                                                                            
Hit:3 https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/amd64  InRelease                                                                                                                       
Hit:4 http://us.archive.ubuntu.com/ubuntu bionic InRelease                                                                                                     
Hit:5 http://dl.google.com/linux/chrome/deb stable InRelease                                                 
Hit:6 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  InRelease                                    
Hit:7 http://us.archive.ubuntu.com/ubuntu bionic-updates InRelease                     
Hit:8 http://us.archive.ubuntu.com/ubuntu bionic-backports InRelease                   
Hit:9 http://security.ubuntu.com/ubuntu bionic-security InRelease        
Hit:10 https://packages.cloud.google.com/apt kubernetes-xenial InRelease                         
Reading package lists... Done                                                                    
E: Repository 'https://nvidia.github.io/libnvidia-container/ubuntu18.04/amd64  InRelease' changed its 'Origin' value from 'https://nvidia.github.io/libnvidia-container' to 'https://nvidia.github.io/libnvidia-container/stable'
N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details.

i repulled CLARA and tryin to restart but looks like tiller is missing … so runnning the bootstrap but thats asking me to go on line and explicit acceptance…

Hi @portal
Thank you for sharing the information!

cuda/gpu driver installation status looks normal. I don’t have a clue for the solution for now.

And, it looks like recently some dependent package was upgraded and require explicit acceptance.
To fix the issue, could you please execute sudo apt update and type y to the question, and then reinstall Clara (sudo ./uninstall-prereqs.sh and sudo ./bootstrap.sh)?

And some models such as liver tumor and covid model may require more GPU memory than 6GB. so please try to execute a smaller model such as brest cancer or chestxray and let us know if the issue still happens.

Thank you!

thanks , it fixed once i had correct nvidia toolkit fixed. and re ran all the jobs etc…

but now when i connect to orthanc server to test PACS pipeline… the job has failed…

below are error details

W0709 21:20:13.834729 main.cpp:1297] Orthanc version: 1.3.1
W0709 21:20:13.834796 main.cpp:1145] Performance warning: Non-release build, runtime debug assertions are turned on
W0709 21:20:13.836010 OrthancInitialization.cpp:162] Scanning folder “/etc/orthanc/” for configuration files
W0709 21:20:13.836063 OrthancInitialization.cpp:114] Reading the configuration from: “/etc/orthanc/orthanc.json”
W0709 21:20:13.836325 OrthancInitialization.cpp:114] Reading the configuration from: “/etc/orthanc/worklists.json”
W0709 21:20:13.836363 OrthancInitialization.cpp:114] Reading the configuration from: “/etc/orthanc/serve-folders.json”
W0709 21:20:13.850748 FromDcmtkBridge.cpp:149] Loading the external DICOM dictionary “/usr/share/libdcmtk12/dicom.dic”
W0709 21:20:13.858436 FromDcmtkBridge.cpp:149] Loading the external DICOM dictionary “/usr/share/libdcmtk12/private.dic”
W0709 21:20:13.862950 FromDcmtkBridge.cpp:2075] Registering JPEG Lossless codecs in DCMTK
W0709 21:20:13.862969 FromDcmtkBridge.cpp:2080] Registering JPEG codecs in DCMTK
W0709 21:20:13.868667 main.cpp:670] Loading plugin(s) from: /usr/share/orthanc/plugins/
W0709 21:20:13.868857 PluginsManager.cpp:269] Registering plugin ‘worklists’ (version 1.3.1)
W0709 21:20:13.868866 PluginsManager.cpp:168] Sample worklist plugin is initializing
W0709 21:20:13.868945 PluginsManager.cpp:168] Worklist server is disabled by the configuration file
W0709 21:20:13.869065 PluginsManager.cpp:269] Registering plugin ‘serve-folders’ (version 1.3.1)
W0709 21:20:13.869189 PluginsManager.cpp:168] ServeFolders: Empty configuration file: No additional folder will be served!
W0709 21:20:13.869202 OrthancInitialization.cpp:998] SQLite index directory: “/var/lib/orthanc/db-v6”
W0709 21:20:13.869288 OrthancInitialization.cpp:1068] Storage directory: “/var/lib/orthanc/db-v6”
W0709 21:20:13.869659 HttpClient.cpp:686] HTTPS will use the CA certificates from this file: /etc/orthanc/
W0709 21:20:13.869823 ServerScheduler.cpp:135] The server scheduler has started
W0709 21:20:13.869969 LuaContext.cpp:103] Lua says: Lua toolbox installed
W0709 21:20:13.870033 ServerContext.cpp:182] Disk compression is disabled
W0709 21:20:13.870043 ServerIndex.cpp:1403] No limit on the number of stored patients
W0709 21:20:13.870058 ServerIndex.cpp:1420] No limit on the size of the storage area
W0709 21:20:13.870364 main.cpp:862] DICOM server listening with AET ORTHANC on port: 4242
W0709 21:20:13.870389 MongooseServer.cpp:1075] HTTP compression is enabled
W0709 21:20:13.871299 main.cpp:795] HTTP server listening on port: 8042
W0709 21:20:13.871318 main.cpp:682] Orthanc has started
E0709 21:20:29.707397 DicomUserConnection.cpp:167] DicomUserConnection: Failed to establish association
0006:0317 Peer aborted Association (or never connected)
0006:031c TCP Initialization Error: Connection refused
E0709 21:20:29.707731 ServerScheduler.cpp:123] Job has failed (HTTP request: Store-SCU to peer “covid”)

Most likely your sender’s association request was rejected by Clara DICOM Adapter. The sender AE title needs to be registered with Clara DICOM Adapter, otherwise, by default the association request is rejected.
Please refer to the latest guide.

Best Regards.

this is my dicom-server-config.yaml file…after the changes…but the dicom pod doesnt restart once i update the dicom yaml

/home/kmaster/.clara/charts/dicom-adapter/files/dicom-server-config.yaml

dicom:
scp:
port: 104
ae-titles:
- ae-title: covid
- ae-title: OrganSeg
- ae-title: LiverSeg
- ae-title: BrainSeg
- ae-title: HippocampusSeg
- ae-title: SpleenSeg
- ae-title: LungSeg
- ae-title: ColonSeg
- ae-title: PancreasSeg
max-associations: 2
verification:
enabled: true
transfer-syntaxes:
- “1.2.840.10008.1.2.4.50” #JPEG Baseline
- “1.2.840.10008.1.2” #Implicit VR Little Endian
- “1.2.840.10008.1.2.1” #Explicit VR Little Endian
- “1.2.840.10008.1.2.2” #Explicit VR Big Endian
log-dimse-datasets: true
reject-unknown-sources: true
sources:
- host-ip: 192.168.1.218
ae-title: ORTHANC
read-aetitles-from-crd: true
read-sources-from-crd: true
scu:
ae-title: ClaraSCU
max-associations: 2
destinations:
- name: MYPACS
host-ip: 192.168.1.218
port: 104
ae-title: ORTHANC
read-destinations-from-crd: true

pipeline-mappings:

  • name: covid-cls
    clara-ae-title: covid
    pipeline-id: def6895b89194491a8c7bfaab84a8f3d
  • name: organ-seg
    clara-ae-title: OrganSeg
    pipeline-id: 1db65f99c9b74329ab9cd519e0557638
  • name: liver-seg
    clara-ae-title: LiverSeg
    pipeline-id: fd3ee8bfb9f34808bd60243f870ff9bd
  • name: brain-seg
    clara-ae-title: BrainSeg
    pipeline-id: 059a899b25fa460a831b9e1ed0c20f80
  • name: hippocampus-seg
    clara-ae-title: HippocampusSeg
    pipeline-id: 5ae39488a8fe4c5aba210bb9e96c1adb
  • name: spleen-seg
    clara-ae-title: SpleenSeg
    pipeline-id: 0ec7700eefe14bada7dba9402743b19d
  • name: lung-seg
    clara-ae-title: LungSeg
    pipeline-id: 6bf927058d4940e8adb8be4f155f5be5
  • name: colon-seg
    clara-ae-title: ColonSeg
    pipeline-id: b612d147ca6c48e08d145c30ed90de8d
  • name: pancreas-seg
    clara-ae-title: PancreasSeg
    pipeline-id: 2b4dc54a0a56409c80b984ef0b2be9c2

Please run kubectl logs [name-of-dicom-adapter-pod, the log should give you detailed reason why it failed to start. Please feel free to provide the log here so we can further assist you. Thanks!

kmaster@UbuntuImage:~$ kubectl logs clara-dicom-adapter-f54bbfd97-hcpsd
Reading logging.config from /opt/nvidia/clara/
2020-07-10 22:27:46.625 +00:00 [INFO] [clara-dicom-adapter-f54bbfd97-hcpsd] Nvidia.Clara.Dicom.Program[1] {} Initialize application with /opt/nvidia/clara/app.yaml
2020-07-10 22:27:46.661 +00:00 [INFO] [clara-dicom-adapter-f54bbfd97-hcpsd] Nvidia.Clara.Dicom.Program[1] {} Platform API endpoint set to 10.99.28.91:50051
2020-07-10 22:27:46.661 +00:00 [INFO] [clara-dicom-adapter-f54bbfd97-hcpsd] Nvidia.Clara.Dicom.Program[1] {} Results Service API endpoint set to http://10.110.240.248:8088
2020-07-10 22:27:46.664 +00:00 [EROR] [clara-dicom-adapter-f54bbfd97-hcpsd] Nvidia.Clara.Dicom.Configuration.ConfigurationValidator[1] {} Invalid Transfer Syntax UID found in dicom>scp>verification>transfer-syntaxes.
2020-07-10 22:27:46.665 +00:00 [FATL] [clara-dicom-adapter-f54bbfd97-hcpsd] Nvidia.Clara.Dicom.Program[1] {} Invalid DICOM configuration.
k

my yaml file has the transfer-syntaxes: as

  • “1.2.840.10008.1.2.4.50” #JPEG Baseline