Clara Insufficient resources

Hi, hope you have a good day!
I have an new question there and hope someone can help me to fix it.

I am triggering pipeline now using local input covid files, but get

$ clara create jobs -n chestxray-test -p 9e9ae449b0024b5598a15e30889c293a  -f input/png/
Unable to create job. Error:Code: -8454, Insufficient resources, can not launch job.

According to this post, I did some attempts:

  1. specify GPU in their pipeline definition , and it doesn’t work for me. Change as follows:
api-version: 0.4.0
name: chestxray-pipeline
pull-secrets:
- ngc-clara
operators:
- name: ai-app-chestxray
 description: Classifying Chest X-ray Images
 container:
   image: nvcr.io/nvidia/clara/ai-chestxray
   tag: 0.7.1-2008.1
 requests:
   gpu: 1
 input:
 - path: /input
 output:
 - path: /output
 services:
 - name: trtis
   container:
     image: nvcr.io/nvidia/tensorrtserver
     tag: 19.08-py3
     command: ["trtserver", "--model-store=~/.clara/pipelines/clara_ai_chestxray_pipeline/models"]
   connections:
     http:
     - name: NVIDIA_CLARA_TRTISURI
       port: 8000
   requests:
       gpu: 1
  1. change availableGpus in ~/.clara/charts/clara/values.yaml to availableGpus=-1. But the availableGpus is set to -1 by default In the version which I used.

Add

  1. clara version

    $ clara version
    Clara CLI version: 0.7.1-12788.ae65aea0
    Clara Platform version: 0.7.1-12788.ae65aea0
    
  2. kubectl describe nodes

    $ kubectl describe nodes   # to see capacity/available resources like below
    Capacity:     
          cpu:                88
          ephemeral-storage:  4522682Mi
           hugepages-1Gi:      0
           hugepages-2Mi:      0
           memory:             396157480Ki
           pods:               110
    Allocatable:
            cpu:                88
            ephemeral-storage:  4522682Mi
            hugepages-1Gi:      0
            hugepages-2Mi:      0
            memory:             396157480Ki
            pods:               110
    
  3. kubectl get all | grep pipesvc

    $ kubectl get all | grep pipesvc
    
    (empty)
    
    
  4. nvidia-smi

$ nvidia-smi
Mon Nov  2 18:33:58 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.95.01    Driver Version: 440.95.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 27%   23C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 20%   25C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 20%   30C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 20%   28C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any idea about it?

Thanks,
Cynthia

Hello @MengYun,

When you execute kubectl describe nodes,
nvidia.com/gpu: part is expected to be shown like below:

Capacity:
 cpu:                12
 ephemeral-storage:  959200352Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65537436Ki
 nvidia.com/gpu:     1
 pods:               110

Could you check the following info?

1) The installation of nvidia-docker (link)

$ docker info

...
 Runtimes: nvidia runc
 Default Runtime: nvidia
...

2) The installation of nvidia gpu device plugin (link)

nvidia-device-plugin

$ kubectl get pod -n kube-system
NAME                                      READY   STATUS    RESTARTS   AGE
coredns-5c98db65d4-4szvd                  1/1     Running   18         27d
coredns-5c98db65d4-t6cxk                  1/1     Running   17         27d
etcd-gbae.nvidia.com                      1/1     Running   10         27d
kube-apiserver-gbae.nvidia.com            1/1     Running   12         27d
kube-controller-manager-gbae.nvidia.com   1/1     Running   17         27d
kube-flannel-ds-amd64-nk8t6               1/1     Running   12         27d
kube-proxy-lbwpw                          1/1     Running   9          27d
kube-scheduler-gbae.nvidia.com            1/1     Running   17         27d
...
nvidia-device-plugin-daemonset-2md74      1/1     Running   10         27d
...
tiller-deploy-659c6788f5-pd6qf            1/1     Running   12         27d

Please let us know if above two are available. Above two components are expected to be installed by executing bootstrap.sh but somehow those do not seem to be installed in your system.

Either 1) install those components manually, or 2) follow the instruction below:

Could you please check if there were error while updating apt repositories while executing bootstrap.sh, by executing sudo apt update?

When you execute sudo apt update, if it asks some prompt that is related to nvidia repository, please confirm that by typing y manually.
(The reason is that repository address of nvidia docker is changed recently so it requires user’s confirmation).
Then, please reexecute bootstrap.sh to re-install nvidia-docker, and then execute the following commands on bootstrap folder to re-install nvidia device-plugin.

    kubectl create -f nvidia-device-plugin.yml

    kubectl create -f rbac-config.yaml

    sudo systemctl restart kubelet

Actually, the simple way is that executing sudo ./uninstall-prereqs.shand execute sudo ./bootstrap.sh again to force-reinstall docker and kubernetes.
(Caution: depending on answers, existing docker/docker-image/kubernetes setup would be reset so be careful to execute ./uninstall-prereques.sh.)