Please provide the following information when requesting support.
Hardware - SYS-740GP-TNRT
Hardware - RTX A6000 x 4 GPUs
Hardware - Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Operating System - Ubuntu 20.04
While trying to deploy NVIDIA triton service with 4 replicas in this server with 1 GPU each, 3 pods were running and 4th pod was not spinning up and following error was displayed.
FailedScheduling pod/model-2-79d7d6786c-bprm8 0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {protect: no_schedule}, that the pod didn't tolerate, 1 node(s) didn't match Pod's node affinity/selector.
Following the Deployment used.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: model-infr
spec:
replicas: 4
template:
metadata:
creationTimestamp: null
labels:
app: model-infr
spec:
containers:
- args:
- pip3 install opencv-python-headless && tritonserver --model-store=s3://model-infr/
command:
- /bin/sh
- -c
image: nvcr.io/nvidia/tritonserver:22.06-py3
imagePullPolicy: IfNotPresent
name: tritonserver
ports:
- containerPort: 8000
name: http
protocol: TCP
- containerPort: 8001
name: grpc
protocol: TCP
- containerPort: 8002
name: metrics
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: dshm
dnsPolicy: ClusterFirst
nodeSelector:
nvidia.com/gpu.product: NVIDIA-RTX-A5000
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- emptyDir:
medium: Memory
name: dshm
Looking forward to know how we can deploy this pod in all 4 GPUs.