Cannot nvidia-smi Geforce 1070 anymore suddenly.

ihcaoe · July 11, 2019, 3:43am

I got information from nvidia-smi yesterday.

But no information today.

I reinstall with NVIDIA-Linux-x86_64-410.93.run and got no nvidia-drm.

My system is ubuntu 18.04 server version.

So I ran NVIDIA-Linux-x86_64-410.93.run --uninstall, add-apt-repository ppa:graphics-drivers/ppa apt-get install nvidia-driver-430.

Reboot and got

$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ lspci | grep ‘VGA’
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)

I will attach nvidia-bug-report.log.gz later.

I have two question:

Which installation is recommandded? NVIDIA-Linux-x86_64-***.run or apt ppa:graphics-drivers/ppa
How to fix this situation?
nvidia-bug-report.log.gz (68 KB)

generix · July 11, 2019, 9:20am

This looks like a complete hardware failure. Looking at the logs, the driver was at first unable to initialize the gpu and it’s now not even recognized by the mainboard anymore. You can try reseating it, testing it in another system to rule out a mainboard/slot failure but I rather suspect the gpu is broken.
The recommended install method is from repo/ppa, not the .run installer.

ihcaoe · July 11, 2019, 9:31am

Thank you for quick reply.

ihcaoe · August 21, 2019, 4:31am

I reinstalled the card from motherboard and worked a moment.
This card is in warranty. I sended it to product service center and they said the testes of this card are passed.

This card worked two weeks, and failed periodly today.

How to description the situation for product service center to repare this card?

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
$ nvidia-smi
No devices were found
$ nvidia-smi
Wed Aug 21 12:05:59 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P5 28W / 230W | 0MiB / 8119MiB | 2% E. Process |
±------------------------------±---------------------±---------------------+

nvidia-bug-report.log.gz (812 KB)

generix · August 21, 2019, 8:35am

Looks like you’re running headless, compute only? Currently, the nvidia-persistenced is always starting and stopping, leading to the gpu being constantly initialized/deinitialized. That doesn’t work, leading to the gpu becomming unresponsive at times. Set the nvidia-persistenced to start on boot and continuously run. If under that condition the nvidia gpu gets lost again, check it in another system to rule out a failure of mainboard/pci slot/psu.

ihcaoe · August 21, 2019, 10:18am

Thank you! I only ran NN compute with this card and never shutdown/reboot this month.
I will try run nvidia-persistenced when it is not found.
Debugging hardware will be a challenge.

Thanks again :)

ihcaoe · September 3, 2019, 4:23am

The OS of our server is Ubuntu 18.04.

I reran the nvidia-persistenced and worked, until reboot the server yesterday.

/var/log/syslog repeated this
Sep 3 11:49:31 fafoy systemd[1]: Starting NVIDIA Persistence Daemon…
Sep 3 11:49:31 fafoy nvidia-persistenced: Verbose syslog connection opened
Sep 3 11:49:31 fafoy nvidia-persistenced: Now running with user ID 112 and group ID 113
Sep 3 11:49:31 fafoy nvidia-persistenced: Started (1728)
Sep 3 11:49:31 fafoy nvidia-persistenced: Received signal 15
Sep 3 11:49:31 fafoy nvidia-persistenced: Shutdown (1728)
Sep 3 11:49:31 fafoy systemd[1]: Stopped NVIDIA Persistence Daemon.
Sep 3 11:49:31 fafoy systemd[1]: nvidia-persistenced.service: Found left-over process 1728 (nvidia-persiste) in control group while starting unit. Ignoring.
Sep 3 11:49:31 fafoy systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 3 11:49:31 fafoy systemd[1]: Starting NVIDIA Persistence Daemon…
Sep 3 11:49:31 fafoy nvidia-persistenced: Verbose syslog connection opened
Sep 3 11:49:31 fafoy nvidia-persistenced: Now running with user ID 112 and group ID 113
Sep 3 11:49:31 fafoy nvidia-persistenced: Started (1739)
Sep 3 11:49:31 fafoy nvidia-persistenced: Received signal 15
Sep 3 11:49:31 fafoy nvidia-persistenced: PID file unlocked.
Sep 3 11:49:31 fafoy nvidia-persistenced: PID file closed.
Sep 3 11:49:31 fafoy nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Sep 3 11:49:31 fafoy nvidia-persistenced: Shutdown (1728)
Sep 3 11:49:32 fafoy nvidia-persistenced: device 0000:01:00.0 - registered
Sep 3 11:49:32 fafoy nvidia-persistenced: Local RPC services initialized
Sep 3 11:49:32 fafoy systemd[1]: Started NVIDIA Persistence Daemon.

So I traced the nvidia-persistenced:

fafoy@fafoy:~$ sudo service nvidia-persistenced status
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vend
Active: active (running) since Tue 2019-09-03 11:58:23 CST; 35s ago
Process: 11160 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exi
Process: 11164 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenc
Main PID: 11165 (nvidia-persiste)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/nvidia-persistenced.service
└─11165 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-

Sep 03 11:58:23 fafoy systemd[1]: Starting NVIDIA Persistence Daemon…
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Verbose syslog connection open
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Now running with user ID 112 a
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Started (11165)
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: device 0000:01:00.0 - register
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Local RPC services initialized
Sep 03 11:58:23 fafoy systemd[1]: Started NVIDIA Persistence Daemon.

Stop it

fafoy@fafoy:~$ sudo service nvidia-persistenced stop

fafoy@fafoy:~$ sudo service nvidia-persistenced status
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vend
Active: inactive (dead)

Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Local RPC services initialized
Sep 03 11:58:23 fafoy systemd[1]: Started NVIDIA Persistence Daemon.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: Received signal 15
Sep 03 11:59:03 fafoy systemd[1]: Stopping NVIDIA Persistence Daemon…
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: Socket closed.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: PID file unlocked.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: PID file closed.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: The daemon no longer has permi
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: Shutdown (11165)
Sep 03 11:59:03 fafoy systemd[1]: Stopped NVIDIA Persistence Daemon.

The config:

fafoy@fafoy:~$ cat /lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

There are some questions:

Is --persistence-mode more suitable than --no-persistence-mode for us?
Which is better? Modifing the config or run nvidia-smi -pm 1?

Thanks!

generix · September 3, 2019, 7:51am

The persistence mode is depreciated, only needed on POWER9 systems and for some older Teslas so with your 1070 you should use --no-persistence-mode. People had problems with the nvidia provided systemd unit before, see this:
[url]FFmpeg cannot init CUDA for transcoding - Linux - NVIDIA Developer Forums

ihcaoe · September 3, 2019, 8:05am

Thanks!

Topic		Replies	Views
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10239	October 14, 2021
Nvidia-persistenced: Failed to query NVIDIA devices Application Accelerator Software cuda , kernel , ubuntu	8	9995	August 18, 2023
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3134	June 8, 2018
Newly installed drivers are not found when nvidia-smi is called. Linux	16	33587	February 10, 2025
[SOLVED] Problems with nvidia-persistenced CUDA Setup and Installation	10	30795	January 11, 2019
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62286	February 14, 2021
Nvidia-smi "No device where found" Linux	36	5956	December 30, 2021
Nvidia-smi and nvidia-persistenced hangs with nvidia driver issue on A100 NVIDIA Virtual GPU Technology	1	2652	January 16, 2024
BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend Linux driver	22	6587	November 27, 2024
Sm-nvidia issues Linux	3	2033	November 4, 2020

Cannot nvidia-smi Geforce 1070 anymore suddenly.

Related topics