fflath
April 28, 2022, 5:46pm
1
I am trying to deploy to a GKE cluster with GPUs attached.
The pod comes up and I can access it but it can’t seem to find the nvidia libraries.
RuntimeError: Function “cuInit” not found
RuntimeError: Function “cuDeviceGetCount” not found
When I try to import the cudf library.
Any advice for deploying/configuring to run on GKE?
Hi fflath,
can you tell us a little more about how your deployment is set up? What image are you running? What does your pod spec look like? What command is set for the container in the pod spec, if any?
fflath
April 28, 2022, 8:58pm
3
Created a new docker file. And the deplyoement spec. I did progress?? Now the kernel dies when i try to import cudf.
from nvcr.io/ea-reopt-member-zone/ea-cuopt:v0.2
RUN conda install -y jupyter
SHELL ["conda", "run", "-n", "cuopt", "/bin/bash", "-c"]
RUN conda install -y -c anaconda ipykernel && \
python -m ipykernel install --user --name=cuopt
WORKDIR /root
CMD ["jupyter","notebook","--allow-root","--ip=0.0.0.0"]
---
apiVersion: "v1"
kind: "ConfigMap"
metadata:
name: "cuopt-config-zjjs"
namespace: "default"
labels:
app: "cuopt"
data:
LD_LIBRARY_PATH: "/usr/local/cuda/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs"
---
apiVersion: "apps/v1"
kind: Deployment
metadata:
labels:
name: "cuopt"
spec:
replicas: 1
selector:
matchLabels:
app: "cuopt"
template:
metadata:
labels:
app: "cuopt"
spec:
containers:
- name: "cuopt-sha256-1"
image: "us-central1-docker.pkg.dev/..."
env:
- name: "LD_LIBRARY_PATH"
valueFrom:
configMapKeyRef:
key: "LD_LIBRARY_PATH"
name: "cuopt-config-zjjs"
ports:
- containerPort: 8888
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: "v1"
kind: "Service"
metadata:
name: "cuopt-service"
namespace: "default"
spec:
ports:
- protocol: "TCP"
port: 8888
targetPort: 8888
selector:
app: "cuopt"
type: "LoadBalancer"
loadBalancerIP: ""
Hi,
From our recent call, I gather that this issue is now resolved - please share the resolution, if possible.
Thanks,
fflath
May 11, 2022, 12:47pm
5
There were several issues. Wrong GPU, GPU quota limits.
I’ve updated with the dockerfile, the deployment file (which also exposes the jupyter service)
The nvidia folder in the docker file just contains the notebooks and helper functions in the online course offered.
from nvcr.io/ea-reopt-member-zone/ea-cuopt:v0.2
RUN conda install -y jupyter
SHELL ["conda", "run", "-n", "cuopt", "/bin/bash", "-c"]
RUN conda install -y -c anaconda ipykernel && \
python -m ipykernel install --user --name=cuopt
WORKDIR /root
COPY nvidia nvidia
CMD ["jupyter","notebook","--allow-root","--ip=0.0.0.0"]
---
apiVersion: "v1"
kind: "ConfigMap"
metadata:
name: "cuopt-config-zjjs"
namespace: "default"
labels:
app: "cuopt"
data:
LD_LIBRARY_PATH: "/usr/local/cuda/lib64:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/stubs"
---
apiVersion: "apps/v1"
kind: Deployment
metadata:
labels:
name: "cuopt"
spec:
replicas: 1
selector:
matchLabels:
app: "cuopt"
template:
metadata:
labels:
app: "cuopt"
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-p100
containers:
- name: "cuopt-sha256-1"
image: "<project artifact repo image"
env:
- name: "LD_LIBRARY_PATH"
valueFrom:
configMapKeyRef:
key: "LD_LIBRARY_PATH"
name: "cuopt-config-zjjs"
ports:
- containerPort: 8888
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: "v1"
kind: "Service"
metadata:
name: "cuopt-service"
namespace: "default"
spec:
ports:
- protocol: "TCP"
port: 8888
targetPort: 8888
selector:
app: "cuopt"
type: "LoadBalancer"
loadBalancerIP: ""
To get it all up and running (make substitutions where needed, project specific networks, regions, etc). Also note this is not the most efficient. I’m not really using the main cluster nodes for anything at the moment. One fairly easy update I think Ill make is to include a pod with openrouteservice running on it. Doing the mapping/visualization in the notebooks would not really on an external api call.
Create an artifact repository
Build the docker image
Push it to the artifact repo
Create a gke cluster
gcloud container clusters create cuopt
–zone us-central1
–network “”
–subnetwork “”
–machine-type n1-standard-4
–node-locations us-central1-c
–scopes=
Create a gpu pool
gcloud container node-pools create cuopt-gpu
–accelerator type=nvidia-tesla-p100,count=1
–zone us-central1
–cluster cuopt
–num-nodes 1
–machine-type n1-standard-4
–node-locations us-central1-c
–scopes=
Remove the no schedule taint from the gpu node
Install nvidia docker stuff to the cluster
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Deploy container
kubectl apply -f autodeployment.yml
Thank you for your response. We really appreciate you sharing the resolution details.
Best,