NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

Hey, after reboot i recieved this message NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver.
5.15.0-43-generic #46~20.04.1-Ubuntu
x86_64 GNU/Linux

nvidia-bug-report.log (343.1 KB)

You might want to run ‘apt --fix-broken install’ to correct these.
The following packages have unmet dependencies:
nvidia-dkms-510 : Depends: nvidia-kernel-common-510 (>= 510.85.02) but 510.73.08-0ubuntu1 is to be installed
nvidia-docker2 : Depends: nvidia-container-toolkit (>= 1.10.0-1) but 1.9.0-1 is to be installed
nvidia-driver-510 : Depends: nvidia-kernel-common-510 (>= 510.85.02) but 510.73.08-0ubuntu1 is to be installed
E: Unmet dependencies. Try ‘apt --fix-broken install’ with no packages (or specify a solution).

I would like to know if you can see something inside the log, that i can prevent the same problem it in te future

It’s a package manager issue, none of that will be caught in the nvidia-bug-report.log.

So, it’s happend after reboot, i run apt --fix-broken install and its solved the problem,
So the problem was just because ubuntu package manager?

I found the log file of package manager, Inside the log i saw that every X-time he start to unpacking nvidia-drivers
The question is why it happend, and how i can prevent it
term.log (28.8 KB)

 trying to overwrite '/usr/bin/nvidia-powerd', which is also in package nvidia-compute-utils-510 510.73.08-0ubuntu1

Looks like a packaging bug, two packages contain the same file.

Could you please check another logs file from another computers please
nvidia-bug-report (1).log (252.6 KB)

nvidia-bug-report.log (264.6 KB)

also, another server with this erorr watchdog: BUG: soft lockup - CPU#10 stuck for 52s! [irq/145-nvidia:802]
nvidia-bug-report.log (2.6 MB)

The first two are missing the kernel modules, check
dkms status
The third is crashing due to gpu errors. Might be due to overheating or the gpu is damaged.

So, on the first two computers Im installed Nvidia-drivers with run file, and its solved the problem
But this issue happend when I run new AI algorithm on GPU

Please use cuda gpu memtest to check the video memory

Hello again the third one is probably crashed " nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error "
nvidia-bug-report.log (1.2 MB)

Any suggestion? The server still doesn’t work

Of course not. Please check if the gpu works in another system, if not, replace.