Cannot nvidia-smi Geforce 1070 anymore suddenly.

I got information from nvidia-smi yesterday.

Wed Jul 10 11:08:43 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 Off | N/A |
| 0% 59C P2 141W / 230W | 7179MiB / 8119MiB | 100% Default |
±------------------------------±---------------------±---------------------+

But no information today.

I reinstall with NVIDIA-Linux-x86_64-410.93.run and got no nvidia-drm.

My system is ubuntu 18.04 server version.

So I ran NVIDIA-Linux-x86_64-410.93.run --uninstall, add-apt-repository ppa:graphics-drivers/ppa apt-get install nvidia-driver-430.

Reboot and got

$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ lspci | grep ‘VGA’
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)

I will attach nvidia-bug-report.log.gz later.

I have two question:

  1. Which installation is recommandded? NVIDIA-Linux-x86_64-***.run or apt ppa:graphics-drivers/ppa
  2. How to fix this situation?
    nvidia-bug-report.log.gz (68 KB)

This looks like a complete hardware failure. Looking at the logs, the driver was at first unable to initialize the gpu and it’s now not even recognized by the mainboard anymore. You can try reseating it, testing it in another system to rule out a mainboard/slot failure but I rather suspect the gpu is broken.
The recommended install method is from repo/ppa, not the .run installer.

Thank you for quick reply.

I reinstalled the card from motherboard and worked a moment.
This card is in warranty. I sended it to product service center and they said the testes of this card are passed.

This card worked two weeks, and failed periodly today.

How to description the situation for product service center to repare this card?

nvidia-smi No devices were found nvidia-smi
Wed Aug 21 12:05:28 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P5 21W / 230W | 0MiB / 8119MiB | 2% E. Process |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
nvidia-smi No devices were found nvidia-smi
Wed Aug 21 12:05:59 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P5 28W / 230W | 0MiB / 8119MiB | 2% E. Process |
±------------------------------±---------------------±---------------------+

nvidia-bug-report.log.gz (812 KB)

Looks like you’re running headless, compute only? Currently, the nvidia-persistenced is always starting and stopping, leading to the gpu being constantly initialized/deinitialized. That doesn’t work, leading to the gpu becomming unresponsive at times. Set the nvidia-persistenced to start on boot and continuously run. If under that condition the nvidia gpu gets lost again, check it in another system to rule out a failure of mainboard/pci slot/psu.

Thank you! I only ran NN compute with this card and never shutdown/reboot this month.
I will try run nvidia-persistenced when it is not found.
Debugging hardware will be a challenge.

Thanks again :)

The OS of our server is Ubuntu 18.04.

I reran the nvidia-persistenced and worked, until reboot the server yesterday.

/var/log/syslog repeated this
Sep 3 11:49:31 fafoy systemd[1]: Starting NVIDIA Persistence Daemon…
Sep 3 11:49:31 fafoy nvidia-persistenced: Verbose syslog connection opened
Sep 3 11:49:31 fafoy nvidia-persistenced: Now running with user ID 112 and group ID 113
Sep 3 11:49:31 fafoy nvidia-persistenced: Started (1728)
Sep 3 11:49:31 fafoy nvidia-persistenced: Received signal 15
Sep 3 11:49:31 fafoy nvidia-persistenced: Shutdown (1728)
Sep 3 11:49:31 fafoy systemd[1]: Stopped NVIDIA Persistence Daemon.
Sep 3 11:49:31 fafoy systemd[1]: nvidia-persistenced.service: Found left-over process 1728 (nvidia-persiste) in control group while starting unit. Ignoring.
Sep 3 11:49:31 fafoy systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 3 11:49:31 fafoy systemd[1]: Starting NVIDIA Persistence Daemon…
Sep 3 11:49:31 fafoy nvidia-persistenced: Verbose syslog connection opened
Sep 3 11:49:31 fafoy nvidia-persistenced: Now running with user ID 112 and group ID 113
Sep 3 11:49:31 fafoy nvidia-persistenced: Started (1739)
Sep 3 11:49:31 fafoy nvidia-persistenced: Received signal 15
Sep 3 11:49:31 fafoy nvidia-persistenced: PID file unlocked.
Sep 3 11:49:31 fafoy nvidia-persistenced: PID file closed.
Sep 3 11:49:31 fafoy nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Sep 3 11:49:31 fafoy nvidia-persistenced: Shutdown (1728)
Sep 3 11:49:32 fafoy nvidia-persistenced: device 0000:01:00.0 - registered
Sep 3 11:49:32 fafoy nvidia-persistenced: Local RPC services initialized
Sep 3 11:49:32 fafoy systemd[1]: Started NVIDIA Persistence Daemon.

So I traced the nvidia-persistenced:

fafoy@fafoy:~$ sudo service nvidia-persistenced status
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vend
Active: active (running) since Tue 2019-09-03 11:58:23 CST; 35s ago
Process: 11160 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exi
Process: 11164 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenc
Main PID: 11165 (nvidia-persiste)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/nvidia-persistenced.service
└─11165 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-

Sep 03 11:58:23 fafoy systemd[1]: Starting NVIDIA Persistence Daemon…
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Verbose syslog connection open
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Now running with user ID 112 a
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Started (11165)
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: device 0000:01:00.0 - register
Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Local RPC services initialized
Sep 03 11:58:23 fafoy systemd[1]: Started NVIDIA Persistence Daemon.

Stop it

fafoy@fafoy:~$ sudo service nvidia-persistenced stop

fafoy@fafoy:~$ sudo service nvidia-persistenced status
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vend
Active: inactive (dead)

Sep 03 11:58:23 fafoy nvidia-persistenced[11165]: Local RPC services initialized
Sep 03 11:58:23 fafoy systemd[1]: Started NVIDIA Persistence Daemon.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: Received signal 15
Sep 03 11:59:03 fafoy systemd[1]: Stopping NVIDIA Persistence Daemon…
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: Socket closed.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: PID file unlocked.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: PID file closed.
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: The daemon no longer has permi
Sep 03 11:59:03 fafoy nvidia-persistenced[11165]: Shutdown (11165)
Sep 03 11:59:03 fafoy systemd[1]: Stopped NVIDIA Persistence Daemon.

The config:

fafoy@fafoy:~$ cat /lib/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

There are some questions:

  1. Is --persistence-mode more suitable than --no-persistence-mode for us?
  2. Which is better? Modifing the config or run nvidia-smi -pm 1?

Thanks!

The persistence mode is depreciated, only needed on POWER9 systems and for some older Teslas so with your 1070 you should use --no-persistence-mode. People had problems with the nvidia provided systemd unit before, see this:
https://devtalk.nvidia.com/default/topic/1052054/linux/ffmpeg-cannot-init-cuda-for-transcoding/post/5340967/#5340967

Thanks!