TAO API service bare metal setup issues

• Hardware: Rtx 3080
• Tao Toolkit Version: 4.0.0

Try to use AutoML and follow the API service bare metal setup on Ubuntu 20.04 notebook

cmd:
bash setup.sh install

error:
TASK [Installing the GPU Operator on NVIDIA Cloud Native Core 6.1] ******************************************************************************************************************
fatal: [192.168.2.89]: FAILED! => {“changed”: true, “cmd”: “helm install --version 1.10.1 --values /tmp/values.yaml --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.version=‘510.47.03’ --wait --generate-name”, “delta”: “0:05:04.570808”, “end”: “2023-02-23 14:54:00.709022”, “msg”: “non-zero return code”, “rc”: 1, “start”: “2023-02-23 14:48:56.138214”, “stderr”: “Error: INSTALLATION FAILED: timed out waiting for the condition”, “stderr_lines”: [“Error: INSTALLATION FAILED: timed out waiting for the condition”], “stdout”: “”, “stdout_lines”: }

I found this TASK in quickstart_api_bare_metal\cnc\cnc-x86-install.yaml

  • name: Installing the GPU Operator on NVIDIA Cloud Native Core 6.1
    when: “enable_mig == false and enable_vgpu == false and enable_rdma == false and enable_gds == false and enable_secure_boot == false and gpu_operator.rc == 1 and network_operator_valid.rc == 1 and ‘running’ in k8sup.stdout and cnc_version == 6.1”
    shell: helm install --version 1.10.1 --values /tmp/values.yaml --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.version=‘{{ gpu_driver_version }}’ --wait --generate-name

Can you share the full log ? It is better to share the full command and the corresponding full logs.

Please uninstall nvidia driver firstly before running setup.sh.

Step:

  1. Check the version via $nvidia-smi
  2. Then, for example, if it is 510 driver, please run below commands to uninstall the driver.
    $ sudo apt purge nvidia-driver-510
    $ sudo apt autoremove
    $ sudo apt autoclean

The nvidia driver is already removed when i run the setup.sh
I have tried to install nvidia driver first then run setup.sh, but the setup.sh will auto remove it.
After that all of the attemps were run without nvidia driver
Which full log? All of the message showup after i run the setup.sh?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please run below firstly before running setup.sh.
$ sudo apt purge nvidia-driver-510
$ sudo apt autoremove
$ sudo apt autoclean

Yes, the full log when you run run the setup.sh. You can attach it via button
image

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.