Installation problems on Azure

Explaining our current state, we have used the following server (seems ok based on your Clara Deploy SDK installation guide)…
VM Size NV12_Promo
Family GPU
vCPUs 12
RAM (GiB) 112
Data disks 48
Max IOPS 48x500
Temp storage (GiB) 680
Premium disk Not supported
(Image 02 in zip)

We have been able to create pipelines and jobs, but not been able to get outputs. We have even tried through your web interface, but without success.
(Image 01 in zip)

The above job (using the sample png chest x ray provided) has been running for 1 hour now, without any output.

Here are the installation details:
(Image 03 in zip)

Additional info … (TOPS for reference)

  1. Just after installation completion
    (Image 04 in zip)

  2. After Clara Deploy Service Start
    (Image 05 in zip)

  3. After Clara start
    (Image 06 in zip)

  4. After Job Start
    (Image 07 in zip)

Need direction where we are going wrong. Struggling for a week now. Thanks in advance.

NvidiaClara.zip (418.6 KB)

Hi Gautam,

Thanks for your interest in Clara Deploy. It looks like the trtis pod is not starting correctly based on the output of kubectl get pods.

Can you share the output of
kubectl describe pod clara-pipesvc-fd7219e9-trtis-678c467c6f-2pgpv
kubectl logs clara-pipesvc-fd7219e9-trtis-678c467c6f-2pgpv
The pod name may have changed from above if it was restarted, adjust accordingly.

Can you also share your pipeline definition?

Thanks,
Kris

Hello kkersten, thanks for responding. Here is the data…

NvidiaClara2.zip (131.4 KB)

Hello kkersten, hope our reply answers your question. Anxiously looking forward to your further inputs. Thanks.

Hi Gautam,

Sorry for the delay. In the pod description, there is an error “no space left on device” for /var/lib/docker/… when trying to pull the image. It’s possible that docker is running out of space or inodes on the filesystem.

Will you provide the output of the following?
df -h
df -ih /var/lib/docker

Thanks,
Kris

A couple more questions:
What OS are you running?
Can you provide the output from the Clara bootstrap.sh script?

This could be an issue in the flannel network config. The K8s pod sandbox error in the pod description points to a missing file, /run/flannel/subnet.env.

Thank you again.

  1. The OS is Ubuntu 18.4
  2. Unfortunately the output of Clara bootstrap.sh was not collected during install. If you insist, we can do the re-installation and provide you the output.
  3. Based on your inputs we have increased the disk space. The job indeed got created and completed, but a single job took a staggering 1/2 hour. Again, there was no output.
    Files attached for your reference.
    NvidiaClara3.zip (222.1 KB)

Please do propose. We understand we may have to re-install (as the disk increment was done after the install).

Hi Gautam,

You should be able to download the output payload by running the following:
clara list jobs
Then using the Job ID for the chestxray job,
clara download <Job ID>:/operators/ai-app-chestxray/*
If the operator ran successfully, you should have a .csv and a .png with the classification labels.

If you don’t see output, we need to debug the pod. Look for the chestxray pod in the output of kubectl get pods. Using this pod name, list the logs for each of the containers in the pod:

kubectl logs <chestxray pod name> pod-manager
kubectl logs <chestxray pod name> ai-app-chestxray

Regarding slow runtime, what GPU does this instance run? Will you provide the output of nvidia-smi?

Can you also confirm that docker is configured correctly to use the GPU? You can test by running:
docker run --rm nvidia/cuda:latest nvidia-smi
You may need to add --gpus=all to the docker run depending on whether or not the nvidia container runtime is set as default.

Also confirm that the GPUs are visible in kubernetes: kubectl describe node | grep gpu

Thanks,
Kris