While trying to schedule a Triton pod its failing

Please provide the following information when requesting support.

Hardware - SYS-740GP-TNRT
Hardware - RTX A6000 x 4 GPUs
Hardware - Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Operating System - Ubuntu 20.04

While trying to deploy NVIDIA triton service with 4 replicas in this server with 1 GPU each, 3 pods were running and 4th pod was not spinning up and following error was displayed.

FailedScheduling    pod/model-2-79d7d6786c-bprm8    0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 1 node(s) didn't match Pod's node affinity/selector.

Following the Deployment used.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: model-infr
spec:
  replicas: 4
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: model-infr
    spec:
      containers:
      - args:
        - pip3 install opencv-python-headless && tritonserver --model-store=s3://model-infr/
        command:
        - /bin/sh
        - -c
        image: nvcr.io/nvidia/tritonserver:22.06-py3
        imagePullPolicy: IfNotPresent
        name: tritonserver
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        - containerPort: 8001
          name: grpc
          protocol: TCP
        - containerPort: 8002
          name: metrics
          protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-RTX-A5000
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
        name: dshm

Looking forward to know how we can deploy this pod in all 4 GPUs.

Hi @shan_8992

Thanks for connecting with us,
Apologies, we currently only handle Riva related queries in the forum
For queries regarding Triton, we request you to file your request in below GitHub link

Thanks

1 Like