Issue on installing deepops 21.09 in DGX A100

We received the following error while installing Deepops.

cuda_dgx_a100_version: “{{ cuda_version }}” # uncommented
deepops_gpu_operator_enabled: false
k8s_gpu_mig_strategy: “none”

LOG:
TASK [nvidia-k8s-gpu-feature-discovery : install nvidia k8s gpu feature discovery] **************************************************************************
fatal: [dgxa100]: FAILED! => changed=true
cmd:

  • /usr/local/bin/helm
  • upgrade
  • –install
  • gpu-feature-discovery
  • nvgfd/gpu-feature-discovery
  • –version
  • 0.4.1
  • –set
  • migStrategy=none
  • –wait
    delta: ‘0:05:00.604667’
    end: ‘2022-01-15 12:23:21.541372’
    msg: non-zero return code
    rc: 1
    start: ‘2022-01-15 12:18:20.936705’
    stderr: ‘Error: UPGRADE FAILED: timed out waiting for the condition’
    stderr_lines:
    stdout: ‘’
    stdout_lines:

Any suggestion on fixing this issue?

Hi @joseph.pang!

It looks like you are timing out when trying to deploy the helm chart for gpu-feature discovery. Have you tried running the command locally on the DGX A100? For example:

$ helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
$ helm repo update
$ helm repo list
(verify that nvgfd repo is there)
$ helm upgrade –install gpu-feature-discovery nvgfd/gpu-feature-discovery –version 0.4.1 –set migStrategy=none –wait

You might have a connection issue with accessing the repo, at which point it would be useful to diagnose whether this is a general connection issue or specifically an issue with access to the repo.

I got the following error after adding nvgfd repo.

Error: UPGRADE FAILED: “gpu-feature-discovery” has no deployed releases

It looks like you might be getting this error as a result of a syntax issue. Can you try running with the following instead? Note the double dashes.

helm upgrade --install gpu-feature-discovery nvgfd/gpu-feature-discovery --version 0.4.1 --set migStrategy=none

I tested both cases. Single dash would result in syntax error.

Then it is likely that you are in a bad state with this helm release. Try uninstalling and then running the command again…

$ helm uninstall gpu-feature-discovery
release "gpu-feature-discovery" uninstalled

$ helm upgrade --install gpu-feature-discovery nvgfd/gpu-feature-discovery --version 0.4.1 --set migStrategy=none
Release "gpu-feature-discovery" does not exist. Installing it now.
NAME: gpu-feature-discovery
LAST DEPLOYED: Thu Jan 20 22:40:23 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

$ helm list
NAME                 	NAMESPACE	REVISION	UPDATED                             	STATUS  	CHART                      	APP VERSION
gpu-feature-discovery	default  	2       	2022-01-20 22:41:55.591266 -0600 CST	deployed	gpu-feature-discovery-0.4.1	0.4.1

Lately, we uninstalled the deepops/21.09 and reconfigured the cluster as follow:

  1. Deepops/22.01 in an bare-metal(ubuntu 18.04) as master.
  2. DGX A100 as node.

However, gpu-operator didn’t start correctly, could you give us suggestion on resolve this problem?

gpu-operator-resources gpu-feature-discovery-tdvm8 0/1 Init:0/1 0 3m1s
gpu-operator-resources nvidia-container-toolkit-daemonset-88md5 0/1 Init:0/1 0 3m1s
gpu-operator-resources nvidia-dcgm-exporter-rtrq5 0/1 Init:0/1 0 3m1s
gpu-operator-resources nvidia-dcgm-q6qgn 0/1 Init:0/1 0 3m1s
gpu-operator-resources nvidia-device-plugin-daemonset-zvhw9 0/1 Init:0/1 0 3m1s
gpu-operator-resources nvidia-driver-daemonset-5x287 0/1 Init:CrashLoopBackOff 14 60m
gpu-operator-resources nvidia-operator-validator-dtxvp 0/1 Init:0/4 0 3m1s