Baremetal install TAO5.0 error

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Classification TF2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.0.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I ran bash setup.sh install and am getting the following timeout error for TASK [Installing the GPU Operator on NVIDIA Cloud Native Core 6.1]
fatal: [127.0.1.1]: FAILED! => {“changed”: true, “cmd”: “helm install --version 23.3.2 --values /tmp/values.yaml --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.enabled=False --set driver.version=‘535.54.03’ --wait --generate-name”, “delta”: “0:05:04.165468”, “end”: “2023-09-18 15:23:33.661303”, “msg”: “non-zero return code”, “rc”: 1, “start”: “2023-09-18 15:18:29.495835”, “stderr”: “Error: INSTALLATION FAILED: timed out waiting for the condition”, “stderr_lines”: [“Error: INSTALLATION FAILED: timed out waiting for the condition”], “stdout”: “”, “stdout_lines”: }

In the past, I have been able to install and uninstall the toolkit api easily. But when I tried it this time after about a month’s time, I got this error

NOTE: I dont’ want to uninstall the existing NVIDIA driver as it is a problem to the other users’ on my system. To ensure that, when the following task appears, I enter N.
TASK [capture user intent to override driver] ******************************************************************************************************************************************************************************************************************************************************************************************************************************

[capture user intent to override driver]

One or more hosts has NVIDIA driver installed. Do you want to override it (y/n)?: n

How did you set gpu-operator-values.yml ?

Suggest you to set as below. The driver version is set to 525.85.12. And install_driver: false.

enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: "525.85.12"
install_driver: false #true

And also enter N when you run into above-mentioned One or more hosts has NVIDIA driver installed. Do you want to override it (y/n)?:.

That’s exactly how it was setup before, then I realized the drivers have been updated from 525.85.12 to 535.54.03 but with both I’m getting the same error

Can you share the full log when you run $bash setup.sh install and $nvidia-smi ?
Thanks.


output.txt (212.6 KB)
output.txt has the out for $bash setup.sh install

Please set gpu-operator-values.yml as below and rerun $bash setup.sh install.

enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: "525.85.12"
install_driver: false 

Even though my driver version is 535.54.03?
Everything else is the same

Yes, please try.

Same error

OK, I will mimic your environment and check further.

1 Like

Could you upload the latest full log? Thanks.

output.txt (139.9 KB)

I cannot reproduce your issue. I install 535 driver and then install TAO-API.

$ local-morganh@ipp1-1080:~/getting_started_v5.0.0/setup/quickstart_api_bare_metal$ nvidia-smi
Wed Sep 27 03:15:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:61:00.0 Off |                    0 |
|  0%   36C    P8              30W / 300W |     18MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     Off | 00000000:DB:00.0 Off |                    0 |
|  0%   37C    P8              29W / 300W |     18MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1283      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      1283      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

Steps:

local-morganh@ipp1-1080:~/getting_started_v5.0.0/setup/quickstart_api_bare_metal$ history
    1  . .bashrc
    2  nvidia-smi
    3  cat /etc/lsb-release
    4  nvidia-smi
    5  sudo apt install nvidia-driver-535
    6  nvidia-smi
    7  sudo reboot
    8  . .bashrc
    9  nvidia-smi
   10  nvidia-smi -L
   11  wget --content-disposition https://ngc.nvidia.com/downloads/ngccli_linux.zip && unzip ngccli_linux.zip && chmod u+x ngc-cli/ngc
   12  ls
   13  find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5
   14  echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
   15  ln -s $(pwd)/ngc-cli/ngc ./ngc
   16  ngc config set
   17  wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/5.0.0/zip -O getting_started_v5.0.0.zip
   18  unzip -u getting_started_v5.0.0.zip -d ./getting_started_v5.0.0 && rm -rf getting_started_v5.0.0.zip && cd ./getting_started_v5.0.0
   19  cd setup/quickstart_api_bare_metal/
   20  vim gpu-operator-values.yml
   21  cat gpu-operator-values.yml
   22  vim hosts
   23  sudo nano /etc/sudoers
   24  vim tao-toolkit-api-ansible-values.yml
   25  cat gpu-operator-values.yml
$ cat gpu-operator-values.yml
enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: "525.85.12"
install_driver: false #true

   26  bash setup.sh check-inventory
   27  bash setup.sh install
   28  kubectl get pods
   29  history

After installation,

local-morganh@ipp1-1080:~/getting_started_v5.0.0/setup/quickstart_api_bare_metal$ kubectl get pods
NAME                                              READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-55794ccb68-sldjn         1/1     Running   0          2m48s
nfs-subdir-external-provisioner-cb9798d4-tp5hx    1/1     Running   0          2m38s
nvidia-smi-ipp1-1080                              1/1     Running   0          2m32s
tao-toolkit-api-app-pod-55c5d88d86-79vpj          1/1     Running   0          2m29s
tao-toolkit-api-jupyterlab-pod-5db94dd6cc-7w4lg   1/1     Running   0          2m29s
tao-toolkit-api-workflow-pod-55db5b9bf9-747jf     1/1     Running   0          2m29s

Is it because of this version vs my 535.54.03 version maybe?

Is there any way to bypass this step? Is it crucial to the functioning of AutoML? I don’t know if I need NVIDIA Cloud Native Core 6.1

Please uninstall and install 535 again. Then reboot and run.
Steps:

Uninstall:  sudo apt purge nvidia-driver-535
                sudo apt autoremove
               sudo apt autoclean
Install:    sudo apt install nvidia-driver-535

@Morganh
I am still getting the same error even after reinstalling the driver

Could you please
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ sudo reboot
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

If I $ bash setup.sh uninstall, kubectl is also uninstalled and I even if I install it separately and run $ kubectl delete crd clusterpolicies.nvidia.com, I get the error that I cannot connect to the server.

Could you please
$ sudo reboot
$ bash setup.sh check-inventory.yml
$ bash setup.sh install