cuOpt-Helm chart deployment issue

gokarna.pavuluri · May 10, 2022, 12:41pm

Hi Team, I am trying to deploy cuopt in OCI through helm charts. But i am facing runtime error as part of the deployment. Please find the attached screenshot for reference. Any inputs on this?

ramakrishnap · May 11, 2022, 2:39pm

Hey, I had few questions regarding this,

Looking at the logs, it seems container is not able to access GPU resources, have you tried to get a simple nvidia container and run it?
can you please test whether you can run a sample nvidia container and see whether it is able to find GPUs, you can find an example at the end in this link Install Kubernetes — NVIDIA Cloud Native Technologies documentation

ramakrishnap · May 11, 2022, 2:52pm

And you can also try to dry run container without trying to fetch cuOpt and check if nvidia GPUs are available,

Add command from following set of commands in ea-cuopt-server/templates/deployment.yaml which you downloaded

      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.registry }}/{{ .Values.nvcuopt.image }}:{{ .Values.nvcuopt.version }}"
          command: ["python", "-m", "http.server"]

Uninstall existing one and install this new one
Install the updated helm chart
kubectl -n $NAMESPACE get all
kubectl exec -it -n $NAMESPACE pod_name /bin/bash
nvidia-smi

user162039 · May 12, 2022, 2:11pm

@gokarna.pavuluri

Hi, have you had a chance to try the above steps to diagnose? It seems like maybe the cluster is not successfully GPU-enabled, as @ramakrishnap suggested. Running nvidia-smi would help determine that.

gokarna.pavuluri · May 12, 2022, 2:48pm

Hi Team, we are working on the solution provided to me. I shall keep you posted on the result.

Thank you.

user162039 · May 17, 2022, 1:48pm

Hello @gokarna.pavuluri,

Have you solved the issue? I can create an OCI cluster to reproduce and troubleshoot in parallel if we are still looking for a solution.