Nvidia-persistenced: Failed to query NVIDIA devices

I am trying to install the cuda-driver and cuda-toolkit on ubuntu 20.04. I am following the instructions from NVIDIA CUDA Installation Guide for Linux.

When I run the following: /usr/bin/nvidia-persistenced --verbose
nvidia-persistenced failed to initialize. Check syslog for more details.

Then the syslog shows this:
Aug 2 07:25:41 srs-uav nvidia-persistenced: Verbose syslog connection opened

Aug 2 07:25:41 srs-uav nvidia-persistenced: Verbose syslog connection opened

Aug 2 07:25:41 srs-uav nvidia-persistenced: Started (33449)

Aug 2 07:25:41 srs-uav nvidia-persistenced: Started (33449)

Aug 2 07:25:41 srs-uav nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.

Aug 2 07:25:41 srs-uav nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.

Aug 2 07:25:41 srs-uav nvidia-persistenced: PID file unlocked.

I did run a bug report and the output is attached. I will appreciate any help with this.
Thank you
nvidia-bug-report.log.gz (154.6 KB)

Please provide the command list you have committed to installing such packages into your system. Also, paste the output of nvidia-smi.

I have checked your error log and suspended maybe you installed driver only supports older Nvidia products:

[437441.727] GeForce 256 (NV10)
[437441.727] GeForce 2 (NV11, NV15)
[437441.727] GeForce 4MX (NV17, NV18)
[437441.727] GeForce 3 (NV20)
[437441.728] GeForce 4Ti (NV25, NV28)
[437441.728] GeForce FX (NV3x)
[437441.728] GeForce 6 (NV4x)
[437441.728] GeForce 7 (G7x)
[437441.728] GeForce 8 (G8x)
[437441.728] GeForce 9 (G9x)
[437441.728] GeForce GTX 2xx/3xx (GT2xx)
[437441.728] GeForce GTX 4xx/5xx (GFxxx)
[437441.728] GeForce GTX 6xx/7xx (GKxxx)
[437441.728] GeForce GTX 9xx (GMxxx)
[437441.728] GeForce GTX 10xx (GPxxx)

Thank you please see below:

$lspci | grep -i nvidia

0000:65:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)

0000:65:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)

$uname -m && cat /etc/*release

x86_64

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=20.04

DISTRIB_CODENAME=focal

DISTRIB_DESCRIPTION=“Ubuntu 20.04.6 LTS”

NAME=“Ubuntu”

VERSION=“20.04.6 LTS (Focal Fossa)”

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME=“Ubuntu 20.04.6 LTS”

VERSION_ID=“20.04”

HOME_URL=“https://www.ubuntu.com/”

SUPPORT_URL=“https://help.ubuntu.com/”

BUG_REPORT_URL=“https://bugs.launchpad.net/ubuntu/”

PRIVACY_POLICY_URL=“https://www.ubuntu.com/legal/terms-and-policies/privacy-policy”

VERSION_CODENAME=focal

UBUNTU_CODENAME=focal

$gcc --version

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

$uname -r

5.15.0-78-generic

sudo apt-get update

sudo apt-get install linux-headers-$(uname -r)

sudo apt-key del 7fa2af80

sudo dpkg -i cuda-keyring_1.1-1_all.deb

sudo apt install libnvidia-common-535

Below, the dependencies will be added based on error message

sudo ubuntu-drivers autoinstall

sudo apt install nvidia-driver-535

sudo apt install nvidia-driver-535 nvidia-dkms-535 nvidia-kernel-source-535 nvidia-kernel-open-535 libnvidia-compute-535:i386 libnvidia-extra-535 nvidia-compute-utils-535 nvidia-compute-utils-535 libnvidia-decode-535:i386 libnvidia-encode-535:i386 libnvidia-fbc1-535:i386 nvidia-utils-535 xserver-xorg-video-nvidia-535

The above didn’t install so I proceeded as follows:

sudo apt-get -y install cuda-drivers

sudo apt-get -y install cuda-toolkit-12-2 cuda-drivers-535 --verbose-versions

sudo apt-get -y install cuda-toolkit-12-2 cuda-drivers-535 nvidia-dkms-535 nvidia-driver-535 nvidia-kernel-common-535 --verbose-versions

sudo apt-get -y install cuda-toolkit-12-2 cuda-drivers-535 nvidia-dkms-535 nvidia-driver-535 nvidia-kernel-common-535 libnvidia-extra-535 --verbose-versions

sudo reboot

$ nvidia-smi

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Also below is the recommended driver

$ sudo ubuntu-drivers devices

vendor : NVIDIA Corporation

I tried to resolve this issue in my lab and hope it can help you resolve your trouble.

  1. Enable the persistence mode globally

root@gpuserver:/home/haitao# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:81:00.0.
All done.

root@gpuserver:/home/haitao# nvidia-smi
Tue Aug  8 08:51:28 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H800 PCIe               On  | 00000000:81:00.0 Off |                    0 |
| N/A   28C    P0              44W / 350W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. Check the current processes list and terminate the session
root@gpuserver:/home/haitao# ps aux | grep persistenced
nvidia-+ 1059454  0.0  0.0   5320  1808 ?        Ss   08:52   0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
root     1059473  0.0  0.0   6608  2260 pts/2    S+   08:55   0:00 grep --color=auto persistenced
root@gpuserver:/home/haitao# kill -9 1059454
root@gpuserver:/home/haitao# ps aux | grep persistenced
root     1059477  0.0  0.0   6608  2312 pts/2    S+   08:56   0:00 grep --color=auto persistenced
  1. Try to run the nvidia-persistenced again

root@gpuserver:/home/haitao# /usr/bin/nvidia-persistenced --verbose

  1. Check the syslog to validate the status
Aug  8 09:03:16 gpuserver nvidia-persistenced: Verbose syslog connection opened
Aug  8 09:03:16 gpuserver nvidia-persistenced: Directory /var/run/nvidia-persistenced will not be removed on exit
Aug  8 09:03:16 gpuserver nvidia-persistenced: Started (1059528)
Aug  8 09:03:16 gpuserver nvidia-persistenced: device 0000:81:00.0 - registered
Aug  8 09:03:16 gpuserver nvidia-persistenced: device 0000:81:00.0 - persistence mode enabled.
Aug  8 09:03:16 gpuserver nvidia-persistenced: device 0000:81:00.0 - NUMA memory onlined.
Aug  8 09:03:16 gpuserver nvidia-persistenced: Local RPC services initialized

All of the end, I think the problem is resolved temporarily. There are 3 points are backlogged here:

  1. In this status, the service nvidia-persistenced.service has wrong status
root@gpuserver:/home/haitao# systemctl status nvidia-persistenced.service
Ă— nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
     Active: failed (Result: signal) since Tue 2023-08-08 08:56:00 UTC; 7min ago
    Process: 1059453 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose (code=exited, status=0/SUCCESS)
    Process: 1059474 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
   Main PID: 1059454 (code=killed, signal=KILL)
        CPU: 7ms


Aug 08 08:52:32 gpuserver systemd[1]: Starting NVIDIA Persistence Daemon...
Aug 08 08:52:32 gpuserver nvidia-persistenced[1059454]: Verbose syslog connection opened
Aug 08 08:52:32 gpuserver nvidia-persistenced[1059454]: Now running with user ID 113 and group ID 118
Aug 08 08:52:32 gpuserver nvidia-persistenced[1059454]: Started (1059454)
Aug 08 08:52:32 gpuserver nvidia-persistenced[1059454]: device 0000:81:00.0 - registered
Aug 08 08:52:32 gpuserver nvidia-persistenced[1059454]: Local RPC services initialized
Aug 08 08:52:32 gpuserver systemd[1]: Started NVIDIA Persistence Daemon.
Aug 08 08:56:00 gpuserver systemd[1]: nvidia-persistenced.service: Main process exited, code=killed, status=9/KILL
Aug 08 08:56:00 gpuserver systemd[1]: nvidia-persistenced.service: Failed with result 'signal'.
  1. The verbose can’t run again unless you repeat step 2 to kill the current process.

  2. I’m not an expert of this component of Nvidia driver. If my elucidation can’t answer your core problem, please figure out and I will try to pull higher resource to see this issue.

I guessed the nvidia-persisitenced should works fine when the service nvidia-persistenced.service is OK. Please try to use other way to check if the persistence works as your expected.

Below is the output the systemctl. Please note the last 5 lines, those are similar to the above step 4.

root@gpuserver:/home/haitao# systemctl restart nvidia-persistenced.service
root@gpuserver:/home/haitao# systemctl status nvidia-persistenced.service
â—Ź nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static)
     Active: active (running) since Fri 2023-08-04 10:31:47 UTC; 3 days ago
   Main PID: 1140 (nvidia-persiste)
      Tasks: 1 (limit: 38213)
     Memory: 984.0K
        CPU: 80ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─1140 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

Aug 04 10:31:46 gpuserver systemd[1]: Starting NVIDIA Persistence Daemon...
Aug 04 10:31:46 gpuserver nvidia-persistenced[1140]: Verbose syslog connection opened
Aug 04 10:31:46 gpuserver nvidia-persistenced[1140]: Now running with user ID 113 and group ID 118
Aug 04 10:31:46 gpuserver nvidia-persistenced[1140]: Started (1140)
Aug 04 10:31:47 gpuserver nvidia-persistenced[1140]: device 0000:81:00.0 - registered
Aug 04 10:31:47 gpuserver nvidia-persistenced[1140]: Local RPC services initialized
Aug 04 10:31:47 gpuserver systemd[1]: Started NVIDIA Persistence Daemon.
Aug 08 08:40:36 gpuserver nvidia-persistenced[1140]: device 0000:81:00.0 - persistence mode enabled.
Aug 08 08:40:36 gpuserver nvidia-persistenced[1140]: device 0000:81:00.0 - NUMA memory onlined.

Regards

I did try early on to globally enable the persistent mode but that didn’t help.
This is what I have now:

nvidia-smi -pm 1

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

All the subsequent steps you have executed were done by me early on as
well but I got negative results.

This was a secure boot issue. I work remotely and was installing remotely. Given that this is a desktop computer, I believe the secure boot part of the installation should have been done onsite. Because we were suppose to confirm the secure boot password after reboot but that did not happen. My apologies realizing that early on

2 Likes

It’s pretty good news for that. I’m sorry I can’t reproduce the issue before you find out the root cause since the disabled secure boot is the default setting of BIOS in my lab.

I am going to close this case and thanks for your contribution on this topic that has been archived into internal KB.

Thank you for your help!