AutoML training speed and GPU problem

You can copy the log from the terminal and then upload it as a txt file.

Hi,
Please uninstall the driver.

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Then, run below
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

And share with the logs.

In the stage of TASK [Waiting for the Cluster to become available], I found that I could not rmmod nvidia module.

When I ran the command: sudo rmmod nvidia, I got the error message

rmmod: ERROR: Module nvidia is in use

How could I do to deal with the problem?

The picture below is the error log in the stage of TASK [Waiting for the Cluster to become available]

The picture below is the last components related to driver
image

It gets stuck here, right?

Exactly, I still get stuck here.

Please try to open a new terminal to run below command.
$ kubectl delete crd clusterpolicies.nvidia.com

If I did it, it would terminate and disapper the GPU driver and GPU related network-plugin.

Is there any methods to deal with it ?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This command does not affect driver.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.