Nvidia-smi failed to recognize all GPU in Ubuntu 20.04

wsy4541 · December 19, 2022, 7:46am

I have a Ubuntu 20.04 server with two RTX3090. The driver version is 525.60.11. I have tried reinstall the driver with another version twice, sometimes that works. But after one day, it still can recognize one GPU. Today, I reinstall the driver with three different drivers, it still didn’t work.
The output of nvidia-smi is as follows.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:3B:00.0 Off | N/A |
| 40% 40C P5 104W / 370W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

The output of lspci |grep NVIDIA is as follows.
(base) root@ubuntu:~# lspci |grep NVIDIA
3b:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
3b:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
86:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)

The output of ls -l /dev/nvidia* is as follows.
(base) root@ubuntu:~# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Dec 19 15:05 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Dec 19 15:05 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Dec 19 15:05 /dev/nvidiactl

But the output of echo is different in two GPU.
(base) root@ubuntu:~# echo ‘hello’ > /dev/nvidia0
-bash: echo: write error: Invalid argument
(base) root@ubuntu:~# echo ‘hello’ > /dev/nvidia1
-bash: /dev/nvidia1: Input/output error
(base) root@ubuntu:~# nvidia-smi

What can I do to fix the problem?

vaibhavgautam41 · December 19, 2022, 8:29am

Did you install the driver version using a cudatoolkit install or as a single driver install?

wsy4541 · December 19, 2022, 8:45am

I just download the .run file from nvidia and run it directely. You mean that I should using cudatoolkit to install the nvidia driver?

vaibhavgautam41 · December 19, 2022, 8:47am

Try to do a network install or other option specified.

vaibhavgautam41 · December 19, 2022, 8:50am

A Deb Local or Deb Network way of install if local file option is not successful for your build and please remove all the files cuda using sudo apt-get purge nvidia-*.

wsy4541 · December 19, 2022, 9:19am

OK, thanks! BTW, should I reinstall CUDA and CUDNN after I use sudo apt-get purge nvidia-* ?

vaibhavgautam41 · December 19, 2022, 9:26am

reinstall with deb local or deb network after purge command but if you install cudatoolkit 12 there is no cudnn separately for it.

generix · December 19, 2022, 10:12am

Installing a different driver from a different source won’t help. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

wsy4541 · December 19, 2022, 10:40am

OK, thanks for your help. I have uploaded the log.gz file to my post. Would you please check it and help me fix the bug?

generix · December 19, 2022, 12:01pm

RmInitAdapter failed! (0x24:0x65:1427)
The gpu might be broken or incorrectly installed. Please reseat it in its pcie slot, swap slots, check power connectors.

wsy4541 · December 19, 2022, 12:14pm

OK, thanks! So you mean it is a hardware issue? But what is very strange is that I have fixed the problem twice by reinstalling the GPU driver.

vaibhavgautam41 · December 19, 2022, 3:27pm

Did you use the same driver?

vaibhavgautam41 · December 19, 2022, 3:29pm

Is this the one you used

Linux X64 (AMD64/EM64T) Display Driver

Version:	525.60.11
Release Date:	2022.11.28
Operating System:	Linux 64-bit
Language:	English (US)
File Size:	394.72 MB

wsy4541 · December 19, 2022, 4:01pm

Yes. And I also have tried 520.56.06, 515.86.01. Sometimes it works by reinstalling the driver and reboot, but now it didn’t.

Topic		Replies	Views
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA Setup and Installation	2	1250	April 11, 2018
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	2	5926	August 16, 2019
Ubuntu 20.04, nvidia-smi has failed. Reinstalling drivers has not solved the issue, GTX 1650 Linux cuda , linux	5	6671	September 9, 2020
Ubuntu 16.04+2 GTX1080 Ti: Nvidia-smi failed to detect all GPUs CUDA Setup and Installation	9	10434	February 5, 2018
Problem Installing Drivers on Ubuntu 20.04 using: nvidia-driver-455, on Lenovo T490 with MX250 dGPU Linux ubuntu	5	5032	October 12, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux	4	903	October 12, 2021
Cannot get nvidia-smi to work with 1050 and Ubuntu 18.04 Linux	11	3073	June 19, 2019
Nvidia-smi not working with ubuntu 18.04 Linux	0	580	September 22, 2021
Ubuntu18.04: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. CUDA Setup and Installation	0	833	April 3, 2019
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux cuda , ubuntu	4	2188	May 4, 2021

Nvidia-smi failed to recognize all GPU in Ubuntu 20.04

Linux X64 (AMD64/EM64T) Display Driver

Related topics