Nvidia-smi failed to recognize all GPU in Ubuntu 20.04

I have a Ubuntu 20.04 server with two RTX3090. The driver version is 525.60.11. I have tried reinstall the driver with another version twice, sometimes that works. But after one day, it still can recognize one GPU. Today, I reinstall the driver with three different drivers, it still didn’t work.
The output of nvidia-smi is as follows.
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA GeForce … Off | 00000000:3B:00.0 Off | N/A |
| 40% 40C P5 104W / 370W | 0MiB / 24576MiB | 0% Default |
| | | N/A |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| No running processes found |

The output of lspci |grep NVIDIA is as follows.
(base) root@ubuntu:~# lspci |grep NVIDIA
3b:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
3b:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
86:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)

The output of ls -l /dev/nvidia* is as follows.
(base) root@ubuntu:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Dec 19 15:05 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Dec 19 15:05 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Dec 19 15:05 /dev/nvidiactl

But the output of echo is different in two GPU.
(base) root@ubuntu:~# echo ‘hello’ > /dev/nvidia0
-bash: echo: write error: Invalid argument
(base) root@ubuntu:~# echo ‘hello’ > /dev/nvidia1
-bash: /dev/nvidia1: Input/output error
(base) root@ubuntu:~# nvidia-smi

What can I do to fix the problem?

Did you install the driver version using a cudatoolkit install or as a single driver install?

I just download the .run file from nvidia and run it directely. You mean that I should using cudatoolkit to install the nvidia driver?

Try to do a network install or other option specified.

A Deb Local or Deb Network way of install if local file option is not successful for your build and please remove all the files cuda using sudo apt-get purge nvidia-*.

OK, thanks! BTW, should I reinstall CUDA and CUDNN after I use sudo apt-get purge nvidia-* ?

reinstall with deb local or deb network after purge command but if you install cudatoolkit 12 there is no cudnn separately for it.

Installing a different driver from a different source won’t help. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

OK, thanks for your help. I have uploaded the log.gz file to my post. Would you please check it and help me fix the bug?

RmInitAdapter failed! (0x24:0x65:1427)
The gpu might be broken or incorrectly installed. Please reseat it in its pcie slot, swap slots, check power connectors.

OK, thanks! So you mean it is a hardware issue? But what is very strange is that I have fixed the problem twice by reinstalling the GPU driver.

Did you use the same driver?

Is this the one you used

Linux X64 (AMD64/EM64T) Display Driver

Version: 525.60.11
Release Date: 2022.11.28
Operating System: Linux 64-bit
Language: English (US)
File Size: 394.72 MB

Yes. And I also have tried 520.56.06, 515.86.01. Sometimes it works by reinstalling the driver and reboot, but now it didn’t.