Persistence Mode(?) causing GPU not to be recognized

varun.n.rao13 · August 2, 2024, 5:19pm

I just updated my Debian workstation with an RTX 3060 from 11 bullseye to 12 bookworm. My use case is mainly running Tensorflow and PyTorch on this box. During the installation of both the NVIDIA Drivers and the CUDA Toolkit, I’ve followed the distribution specific package manager install instructions – basically apt-get for everything.

Followed this Debian page for the drivers and used the NVIDIA CUDA installation guide for the CUDA Toolkit. I’ve successfully installed the NVIDIA drivers (version 535.x) and the CUDA Toolkit (v12.3) itself.

To solve this I basically went through the install guides once again and double-checked to make sure that I hadn’t missed a step during the install process. After apt-get install, I’ve done the following steps

Amended the local PATH variable to include /usr/local/cuda-12.3/bin so that the system can pick up nvcc and other utilities (works fine)
Started and enabled the Persistence Daemon (nvidia-persistenced.service) under systemctl so that it would run under root and be available at boot
Cloned the cuda-samples github repo to test that everything has worked. I had to specifically go to the folders of those samples I wanted to build since the general make command threw up errors and basically refused to compile.

I first noticed the issue when I attempted to run the deviceQuery from cuda-samples. As I understand from the installation guide this is basically a sanity check to ensure that the installation process has gone off without a hitch. When I ran deviceQuery I got a RESULT = FAIL output. The installation guide mentions that this probably meant that either /dev/nvidia* files were missing or that they had the wrong permissions.

/dev/nvidia0 existed so I figured that it was probably a case of having mismatched permissions on my device and sure enough it was owned by root:root instead of something like root:video which I saw on a couple of forum pages here. I didn’t have this video group so I created it, added my local user to the group and then used chgrp as sudo to change the permissions. Sure enough, this worked and I was able to get the expected output from deviceQuery. I tested both Tensorflow and PyTorch and found that both were reading the GPU.

I thought the issue was fixed and called it a day. However, after rebooting the next day, my PyTorch script was slow and on investigation I found that not only was PyTorch not recognizing the GPU and throwing up a cuInit error but the ownership on /dev/nvidia0 were reset to root:root from the root:video I had changed it to. I went through a couple of pages here on the forum and after some google-fu ended up concluding that it was because of some issue with the Persistence Mode since that had some impact on permissions.

So I ended up on this dev page which had more details on the Persistence Mode. Along with this, a couple of other pages here on the also mentioned that if I run the following

su -
nvidia-smi -pm 1

that it should work.

This solution does indeed work. If I switch to the root user and then run nvidia-smi with the pm flag, it says that Persistence Mode is already Enabled but deviceQuery and PyTorch/Tensorflow work only after running this command. I don’t have to change groups or anything. Moreover, it was my understanding that this is what nvidia-persistenced is designed for. I shouldn’t have to manually run a sequence of commands every time I boot my machine to get this to work.

My main problem that I am unable to find a solution to and hope someone on here can help me with is – what am I doing wrong with the Persistence Mode that it isn’t available at boot up? Is this a Persistence Mode thing in the first place? Or a systemd thing? nvidia-smi says that Persistence Mode is On but I still have to run the command as root before my GPU is recognized by either PyTorch or Tensorflow.

systemctl status

root@debian:~# systemctl status nvidia-persistenced.service 
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/nvidia-persistenced.service.d
             └─override.conf
     Active: active (running) since Fri 2024-08-02 11:47:20 IST; 10h ago
    Process: 794 ExecStart=/usr/bin/nvidia-persistenced --persistence-mode --user=debian (code=exited, status=0/SUCCESS)
   Main PID: 828 (nvidia-persiste)
      Tasks: 1 (limit: 38180)
     Memory: 2.9M
        CPU: 911ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─828 /usr/bin/nvidia-persistenced --persistence-mode --user=debian

Aug 02 11:47:18 debian systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Aug 02 11:47:18 debian nvidia-persistenced[828]: Started (828)
Aug 02 11:47:20 debian systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.

The setup file for systemd (systemctl edit nvidia-persistenced.service as root)

[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=
ExecStart=/usr/bin/nvidia-persistenced --persistence-mode --user=debian
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

I’m quite at my wits’ end about this since it appears that I’ve followed the install guides properly and still have this issue. So I’d really appreciate it if someone could help me figure out what’s wrong.

Topic		Replies	Views
Nvidia-persistenced: Failed to query NVIDIA devices Application Accelerator Software cuda , kernel , ubuntu	8	10378	August 18, 2023
Cannot nvidia-smi Geforce 1070 anymore suddenly. Linux	9	1632	October 12, 2021
Setting up nvidia-persistenced CUDA Setup and Installation	12	46670	July 19, 2020
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3134	June 8, 2018
NVidia driver 520.61.05 / Cuda 11.8 / RTX 3090 = black display and superslow modesets Linux cuda , ubuntu	21	24368	December 6, 2022
Nvidia-persistenced fails to start if user option is set to non-root user Linux	4	6686	October 12, 2021
Nvidia-smi fails on CentOS 7 since a semi-recent update CUDA Setup and Installation cuda , nvidia-smi	1	1928	January 26, 2022
Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices) CUDA Setup and Installation	7	6394	November 27, 2018
CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation	1	1120	April 25, 2018
Tensorflow coredump no supported devices found for CUDA (Docker nvcr.io container), after reboot nvidia-smi can't find driver Linux cuda , tensorflow	2	2573	October 8, 2020

Persistence Mode(?) causing GPU not to be recognized

Related topics