I have a Ubuntu 20.04 server with two RTX3090. The driver version is 525.60.11. I have tried reinstall the driver with another version twice, sometimes that works. But after one day, it still can recognize one GPU. Today, I reinstall the driver with three different drivers, it still didn’t work.
The output of nvidia-smi is as follows.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:3B:00.0 Off | N/A |
| 40% 40C P5 104W / 370W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
The output of lspci |grep NVIDIA is as follows.
(base) root@ubuntu:~# lspci |grep NVIDIA
3b:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
3b:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
86:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
The output of ls -l /dev/nvidia* is as follows.
(base) root@ubuntu:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Dec 19 15:05 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Dec 19 15:05 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Dec 19 15:05 /dev/nvidiactl
But the output of echo is different in two GPU.
(base) root@ubuntu:~# echo ‘hello’ > /dev/nvidia0
-bash: echo: write error: Invalid argument
(base) root@ubuntu:~# echo ‘hello’ > /dev/nvidia1
-bash: /dev/nvidia1: Input/output error
(base) root@ubuntu:~# nvidia-smi
What can I do to fix the problem?