Deploying on GKE

fflath · April 28, 2022, 5:46pm

I am trying to deploy to a GKE cluster with GPUs attached.

The pod comes up and I can access it but it can’t seem to find the nvidia libraries.

RuntimeError: Function “cuInit” not found
RuntimeError: Function “cuDeviceGetCount” not found

When I try to import the cudf library.

Any advice for deploying/configuring to run on GKE?

user162039 · April 28, 2022, 7:22pm

Hi fflath,

can you tell us a little more about how your deployment is set up? What image are you running? What does your pod spec look like? What command is set for the container in the pod spec, if any?

fflath · April 28, 2022, 8:58pm

Created a new docker file. And the deplyoement spec. I did progress?? Now the kernel dies when i try to import cudf.

from nvcr.io/ea-reopt-member-zone/ea-cuopt:v0.2
RUN  conda install -y jupyter
SHELL ["conda", "run", "-n", "cuopt", "/bin/bash", "-c"]
RUN conda install -y -c anaconda ipykernel && \
  python -m ipykernel install --user --name=cuopt
WORKDIR /root
CMD ["jupyter","notebook","--allow-root","--ip=0.0.0.0"]

---
apiVersion: "v1"
kind: "ConfigMap"
metadata:
  name: "cuopt-config-zjjs"
  namespace: "default"
  labels:
    app: "cuopt"
data:
  LD_LIBRARY_PATH: "/usr/local/cuda/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs"
---
apiVersion: "apps/v1"
kind: Deployment
metadata:
  labels:
  name: "cuopt"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "cuopt"
  template:
    metadata:
      labels:
        app: "cuopt"
    spec:
      containers:
      - name: "cuopt-sha256-1"
        image: "us-central1-docker.pkg.dev/..."
        env:
        - name: "LD_LIBRARY_PATH"
          valueFrom:
            configMapKeyRef:
              key: "LD_LIBRARY_PATH"
              name: "cuopt-config-zjjs"
        ports:
        - containerPort: 8888
        resources:
          limits:
          nvidia.com/gpu: 1
---
apiVersion: "v1"
kind: "Service"
metadata:
  name: "cuopt-service"
  namespace: "default"
spec:
  ports:
  - protocol: "TCP"
    port: 8888
    targetPort: 8888
  selector:
    app: "cuopt"
  type: "LoadBalancer"
  loadBalancerIP: ""

preethim · April 29, 2022, 5:07pm

Hi,

From our recent call, I gather that this issue is now resolved - please share the resolution, if possible.

Thanks,

fflath · May 11, 2022, 12:47pm

There were several issues. Wrong GPU, GPU quota limits.

I’ve updated with the dockerfile, the deployment file (which also exposes the jupyter service)
The nvidia folder in the docker file just contains the notebooks and helper functions in the online course offered.

from nvcr.io/ea-reopt-member-zone/ea-cuopt:v0.2


RUN  conda install -y jupyter
SHELL ["conda", "run", "-n", "cuopt", "/bin/bash", "-c"]
RUN conda install -y -c anaconda ipykernel && \
  python -m ipykernel install --user --name=cuopt

WORKDIR /root
COPY nvidia nvidia
CMD ["jupyter","notebook","--allow-root","--ip=0.0.0.0"]

---
apiVersion: "v1"
kind: "ConfigMap"
metadata:
  name: "cuopt-config-zjjs"
  namespace: "default"
  labels:
    app: "cuopt"
data:
  LD_LIBRARY_PATH: "/usr/local/cuda/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs"
---
apiVersion: "apps/v1"
kind: Deployment
metadata:
  labels:
  name: "cuopt"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "cuopt"
  template:
    metadata:
      labels:
        app: "cuopt"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-p100
      containers:
      - name: "cuopt-sha256-1"
        image: "<project artifact repo image"
        env:
        - name: "LD_LIBRARY_PATH"
          valueFrom:
            configMapKeyRef:
              key: "LD_LIBRARY_PATH"
              name: "cuopt-config-zjjs"
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: "v1"
kind: "Service"
metadata:
  name: "cuopt-service"
  namespace: "default"
spec:
  ports:
  - protocol: "TCP"
    port: 8888
    targetPort: 8888
  selector:
    app: "cuopt"
  type: "LoadBalancer"
  loadBalancerIP: ""

To get it all up and running (make substitutions where needed, project specific networks, regions, etc). Also note this is not the most efficient. I’m not really using the main cluster nodes for anything at the moment. One fairly easy update I think Ill make is to include a pod with openrouteservice running on it. Doing the mapping/visualization in the notebooks would not really on an external api call.

Create an artifact repository
Build the docker image
Push it to the artifact repo
Create a gke cluster

gcloud container clusters create cuopt
–zone us-central1
–network “”
–subnetwork “”
–machine-type n1-standard-4
–node-locations us-central1-c
–scopes=

Create a gpu pool

gcloud container node-pools create cuopt-gpu
–accelerator type=nvidia-tesla-p100,count=1
–zone us-central1
–cluster cuopt
–num-nodes 1
–machine-type n1-standard-4
–node-locations us-central1-c
–scopes=

Remove the no schedule taint from the gpu node
Install nvidia docker stuff to the cluster

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Deploy container
kubectl apply -f autodeployment.yml

preethim · May 11, 2022, 12:51pm

Thank you for your response. We really appreciate you sharing the resolution details.

Best,

Topic		Replies	Views
Error starting up CuOpt container cuOpt	8	1492	May 16, 2022
Is the vectoradd-cuda container for 11.4 available? CUDA Programming and Performance	6	1919	August 4, 2021
Cannot install cuda CUDA Setup and Installation	9	4400	June 27, 2024
Cuda directory inside container doesn't contain enough libraries to import torch Jetson AGX Xavier tensorrt , cuda , pytorch	4	704	June 16, 2023
GPUOperator Support on CentOS 7.8 - GLIBC_2.27 Docker and NVIDIA Docker	0	1867	August 14, 2020
K20c with CoreOS Linux	2	1595	July 5, 2017
nvidia-docker seems unable to use GPU as non-root user Jetson TX2	8	9007	October 18, 2021
Errors in setting up GPU with CUDA and cuDNN CUDA Setup and Installation	0	2298	November 7, 2021
Server error cuOpt cuOpt	5	1190	February 24, 2023
problem running demos CUDA Programming and Performance	9	8203	January 1, 2009

Deploying on GKE

Related topics