Insufficient resources, can not launch job

Hello,

I had Clara working but suddenly Kube api server stopped working. I tried to fix the issue but couldn’t so ended up reinstalling Clara.

After reinstallation, Kubernetes and Clara are working but when I send DICOMs to Clara Adapter, no pipeline is created.

Clara adapter logs are saying that Job in not created because of Insufficient resources.

2021-06-30 11:02:30.097 +00:00 [INFO] [clara-dicom-adapter-96948fff7-r6rwh] Nvidia.Clara.DicomAdapter.API.JobProcessorBase[31] {JobId="755fa38933dd42a191dbd084270437bd", PayloadId="42dd8ae152064cda8a71adfa6f7e2bbf", JobName="REMOVEDAETITLE-13.05.1973", AE Title="REMOVEDAETITLE"} Upload to payload completed
2021-06-30 11:02:30.300 +00:00 [EROR] [clara-dicom-adapter-96948fff7-r6rwh] Nvidia.Clara.DicomAdapter.Server.Repositories.ClaraJobsApi[31] {JobId="755fa38933dd42a191dbd084270437bd", PayloadId="42dd8ae152064cda8a71adfa6f7e2bbf", JobName="REMOVEDAETITLE-13.05.1973", AE Title="REMOVEDAETITLE"} Exception while starting a new job: System.InvalidOperationException: Server returned error code (-8454)
Insufficient resources, can not launch job.

   at Nvidia.Clara.Platform.BaseClient.CheckResponseHeader(ResponseHeader header)
   at Nvidia.Clara.Platform.JobsClient.StartJob(JobId jobId, IReadOnlyList`1 namedValues, CancellationToken cancellationToken)
   at Nvidia.Clara.DicomAdapter.Server.Repositories.ClaraJobsApi.<>c__DisplayClass5_0.<<Start>b__2>d.MoveNext() in /Clara/src/Services/DicomAdapter/src/Server/Repositories/ClaraJobsApi.cs:line 90
--- End of stack trace from previous location where exception was thrown ---
   at Polly.AsyncPolicy.<>c__DisplayClass40_0.<<ImplementationAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func`3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates`1 shouldRetryResultPredicates, Func`5 onRetryAsync, Int32 permittedRetryCount, IEnumerable`1 sleepDurationsEnumerable, Func`4 sleepDurationProvider, Boolean continueOnCapturedContext)
2021-06-30 11:02:32.348 +00:00 [EROR] [clara-dicom-adapter-96948fff7-r6rwh] Nvidia.Clara.DicomAdapter.Server.Repositories.ClaraJobsApi[31] {JobId="755fa38933dd42a191dbd084270437bd", PayloadId="42dd8ae152064cda8a71adfa6f7e2bbf", JobName="REMOVEDAETITLE-13.05.1973", AE Title="REMOVEDAETITLE"} Exception while starting a new job: System.InvalidOperationException: Server returned error code (-8454)
Insufficient resources, can not launch job.

clara-clara-platformapiserver logs show

[D:INFO] Pipeline Service Deployer [Kubernetes] | Extracted Service definition. Name: "trtis", ContainerImage: "nvcr.io/nvidia/tritonserver".
[M:ERROR] Jobs Service                  | Nvidia.Clara.Platform.Server.Services.InsufficientResourcesException: Unable to start job due to insufficient gpus
                                        |    at Nvidia.Clara.Platform.Server.Services.JobsService.JobResourceRequestBuilder.BuildResourceRequests(Job job)
                                        |    at Nvidia.Clara.Platform.Server.Services.JobsService.Start(JobsStartRequest request, ServerCallContext context) in /Clara/src/Platform/Server/Services/JobsService.cs:line 1306

From this looks like Clara Platform API server is unable to find gpu.

Logs when restarted the pod

[N:INFO] Platform Server                | Starting Clara Platform Server.
[D:INFO] Clara Platform Server          | Host:                      0.0.0.0
                                        | Port:                      50,051
                                        | Resolver:                  Clara
                                        | Repositories:              K8s
                                        | Storage:                   Disk (/clara/payloads)
                                        | Execution Selector:        Clara
                                        | Service Deployer:          K8s
                                        | Inference Server Deployer: K8s
                                        | Service Volume:            Disk (/clara/service-volumes)
                                        | Inference Server Volume:   Disk (/clara/triton)
                                        | Available GPUs             -1
                                        | Common Volume:             K8s (clara-platformapiserver-common-volume-claim)
                                        | Trace Listeners:           Console, Clara
[D:INFO] Resource Provider              | Added event listener, listener count: 1.

I have found similar bugs reported on forum but solutions mentioned in those did not help me.

Will you please help me fix this issue?

Hello,

Sorry to see that you’re running into issue with Clara Deploy. I’ve got a few questions to ask before I can make any recommendations to fix your issue.

  • Which version of Clara Deploy is generating this issue?
  • What resources (CPU cores, system memory, GPUs) does the server Clara Deploy is operating in have?
  • Does the pipeline you’re trying to run require a GPU? If so, how many GPUs?

Clara CLI version: 0.7.4-18224.d76e47b7
Clara Platform version: 0.7.4-18224.d76e47b7

This server has only one GPU. If I run nvida-smi I can see GPU.

Yes the pipeline I am trying to execute requires GPU and requires only one.

I am trying to start Triton as Service. Below is the relevant part from pipeline

services:
      - name: trtis
        container:
          image: nvcr.io/nvidia/tritonserver
          tag: 21.03-py3
          command: ["tritonserver", "--model-store=$(NVIDIA_CLARA_SERVICE_DATA_PATH)/models"]
        requests:
          gpu: 1
        connections:
          http:
            - name: NVIDIA_CLARA_TRTISURI
              port: 8000

Clara Platform API logs shows that GPU is found on node.

[D:INFO] Analysis Message               | Begin analyzing definition "changed-name".
[D:INFO] Analysis Message               | Analysis completed with 0 error(s).
[D:INFO] Resource Provider              | Starting discovery of all nodes/machines available in the cluster.
[D:INFO] Resource Provider              | 1 active node(s) reported by the cluster.
[D:INFO] Resource Provider              | Machine discovered: { addresses: 2, gpus: 1, name: "changed-name", operating-system: "linux" }
[D:PERF 00.055] Resource Provider       | Completed discovery of all nodes/machines available in the cluster.

But for some reason it is saying insufficient gpus when trying to start Triton.

M:ERROR] Jobs Service                  | Nvidia.Clara.Platform.Server.Services.InsufficientResourcesException: Unable to start job due to insufficient gpus
                                        |    at Nvidia.Clara.Platform.Server.Services.JobsService.JobResourceRequestBuilder.BuildResourceRequests(Job job)
                                        |    at Nvidia.Clara.Platform.Server.Services.JobsService.Start(JobsStartRequest request, ServerCallContext context) in /Clara/src/Platform/Server/Services/JobsService.cs:line 1306

If I create pipeline without Triton, then it runs withouts any issues.

Any idea what might be causing this?

I have managed to fix this issue.

Thanks for navigating this issue - would you mind spelling out how you overcome it?