AutoML training speed and GPU problem

Morganh · March 21, 2023, 1:44pm

You can copy the log from the terminal and then upload it as a txt file.

Morganh · March 27, 2023, 7:11am

Hi,
Please uninstall the driver.

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Then, run below
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

And share with the logs.

swka1043338 · March 29, 2023, 2:23am

In the stage of TASK [Waiting for the Cluster to become available], I found that I could not rmmod nvidia module.

When I ran the command: sudo rmmod nvidia, I got the error message

rmmod: ERROR: Module nvidia is in use

How could I do to deal with the problem?

The picture below is the error log in the stage of TASK [Waiting for the Cluster to become available]

The picture below is the last components related to driver

Morganh · March 29, 2023, 2:27am

It gets stuck here, right?

swka1043338 · March 29, 2023, 2:28am

Exactly, I still get stuck here.

Morganh · March 29, 2023, 2:28am

Please try to open a new terminal to run below command.
$ kubectl delete crd clusterpolicies.nvidia.com

swka1043338 · March 29, 2023, 2:30am

If I did it, it would terminate and disapper the GPU driver and GPU related network-plugin.

Is there any methods to deal with it ?

Morganh · March 29, 2023, 2:33am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This command does not affect driver.

system · April 25, 2023, 2:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.