Deploying on GKE

I am trying to deploy to a GKE cluster with GPUs attached.

The pod comes up and I can access it but it can’t seem to find the nvidia libraries.

RuntimeError: Function “cuInit” not found
RuntimeError: Function “cuDeviceGetCount” not found

When I try to import the cudf library.

Any advice for deploying/configuring to run on GKE?

Hi fflath,

can you tell us a little more about how your deployment is set up? What image are you running? What does your pod spec look like? What command is set for the container in the pod spec, if any?

Created a new docker file. And the deplyoement spec. I did progress?? Now the kernel dies when i try to import cudf.

from nvcr.io/ea-reopt-member-zone/ea-cuopt:v0.2
RUN  conda install -y jupyter
SHELL ["conda", "run", "-n", "cuopt", "/bin/bash", "-c"]
RUN conda install -y -c anaconda ipykernel && \
  python -m ipykernel install --user --name=cuopt
WORKDIR /root
CMD ["jupyter","notebook","--allow-root","--ip=0.0.0.0"]
---
apiVersion: "v1"
kind: "ConfigMap"
metadata:
  name: "cuopt-config-zjjs"
  namespace: "default"
  labels:
    app: "cuopt"
data:
  LD_LIBRARY_PATH: "/usr/local/cuda/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs"
---
apiVersion: "apps/v1"
kind: Deployment
metadata:
  labels:
  name: "cuopt"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "cuopt"
  template:
    metadata:
      labels:
        app: "cuopt"
    spec:
      containers:
      - name: "cuopt-sha256-1"
        image: "us-central1-docker.pkg.dev/..."
        env:
        - name: "LD_LIBRARY_PATH"
          valueFrom:
            configMapKeyRef:
              key: "LD_LIBRARY_PATH"
              name: "cuopt-config-zjjs"
        ports:
        - containerPort: 8888
        resources:
          limits:
          nvidia.com/gpu: 1
---
apiVersion: "v1"
kind: "Service"
metadata:
  name: "cuopt-service"
  namespace: "default"
spec:
  ports:
  - protocol: "TCP"
    port: 8888
    targetPort: 8888
  selector:
    app: "cuopt"
  type: "LoadBalancer"
  loadBalancerIP: ""

Hi,

From our recent call, I gather that this issue is now resolved - please share the resolution, if possible.

Thanks,

There were several issues. Wrong GPU, GPU quota limits.

I’ve updated with the dockerfile, the deployment file (which also exposes the jupyter service)
The nvidia folder in the docker file just contains the notebooks and helper functions in the online course offered.

from nvcr.io/ea-reopt-member-zone/ea-cuopt:v0.2


RUN  conda install -y jupyter
SHELL ["conda", "run", "-n", "cuopt", "/bin/bash", "-c"]
RUN conda install -y -c anaconda ipykernel && \
  python -m ipykernel install --user --name=cuopt

WORKDIR /root
COPY nvidia nvidia
CMD ["jupyter","notebook","--allow-root","--ip=0.0.0.0"]
---
apiVersion: "v1"
kind: "ConfigMap"
metadata:
  name: "cuopt-config-zjjs"
  namespace: "default"
  labels:
    app: "cuopt"
data:
  LD_LIBRARY_PATH: "/usr/local/cuda/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs"
---
apiVersion: "apps/v1"
kind: Deployment
metadata:
  labels:
  name: "cuopt"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "cuopt"
  template:
    metadata:
      labels:
        app: "cuopt"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-p100
      containers:
      - name: "cuopt-sha256-1"
        image: "<project artifact repo image"
        env:
        - name: "LD_LIBRARY_PATH"
          valueFrom:
            configMapKeyRef:
              key: "LD_LIBRARY_PATH"
              name: "cuopt-config-zjjs"
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: "v1"
kind: "Service"
metadata:
  name: "cuopt-service"
  namespace: "default"
spec:
  ports:
  - protocol: "TCP"
    port: 8888
    targetPort: 8888
  selector:
    app: "cuopt"
  type: "LoadBalancer"
  loadBalancerIP: ""

To get it all up and running (make substitutions where needed, project specific networks, regions, etc). Also note this is not the most efficient. I’m not really using the main cluster nodes for anything at the moment. One fairly easy update I think Ill make is to include a pod with openrouteservice running on it. Doing the mapping/visualization in the notebooks would not really on an external api call.

  1. Create an artifact repository
  2. Build the docker image
  3. Push it to the artifact repo
  4. Create a gke cluster

gcloud container clusters create cuopt
–zone us-central1
–network “”
–subnetwork “”
–machine-type n1-standard-4
–node-locations us-central1-c
–scopes=

  1. Create a gpu pool

gcloud container node-pools create cuopt-gpu
–accelerator type=nvidia-tesla-p100,count=1
–zone us-central1
–cluster cuopt
–num-nodes 1
–machine-type n1-standard-4
–node-locations us-central1-c
–scopes=

  1. Remove the no schedule taint from the gpu node
  2. Install nvidia docker stuff to the cluster

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

  1. Deploy container
    kubectl apply -f autodeployment.yml

Thank you for your response. We really appreciate you sharing the resolution details.

Best,