Persistence Mode(?) causing GPU not to be recognized

I just updated my Debian workstation with an RTX 3060 from 11 bullseye to 12 bookworm. My use case is mainly running Tensorflow and PyTorch on this box. During the installation of both the NVIDIA Drivers and the CUDA Toolkit, I’ve followed the distribution specific package manager install instructions – basically apt-get for everything.

Followed this Debian page for the drivers and used the NVIDIA CUDA installation guide for the CUDA Toolkit. I’ve successfully installed the NVIDIA drivers (version 535.x) and the CUDA Toolkit (v12.3) itself.

To solve this I basically went through the install guides once again and double-checked to make sure that I hadn’t missed a step during the install process. After apt-get install, I’ve done the following steps

  1. Amended the local PATH variable to include /usr/local/cuda-12.3/bin so that the system can pick up nvcc and other utilities (works fine)
  2. Started and enabled the Persistence Daemon (nvidia-persistenced.service) under systemctl so that it would run under root and be available at boot
  3. Cloned the cuda-samples github repo to test that everything has worked. I had to specifically go to the folders of those samples I wanted to build since the general make command threw up errors and basically refused to compile.

I first noticed the issue when I attempted to run the deviceQuery from cuda-samples. As I understand from the installation guide this is basically a sanity check to ensure that the installation process has gone off without a hitch. When I ran deviceQuery I got a RESULT = FAIL output. The installation guide mentions that this probably meant that either /dev/nvidia* files were missing or that they had the wrong permissions.

/dev/nvidia0 existed so I figured that it was probably a case of having mismatched permissions on my device and sure enough it was owned by root:root instead of something like root:video which I saw on a couple of forum pages here. I didn’t have this video group so I created it, added my local user to the group and then used chgrp as sudo to change the permissions. Sure enough, this worked and I was able to get the expected output from deviceQuery. I tested both Tensorflow and PyTorch and found that both were reading the GPU.

I thought the issue was fixed and called it a day. However, after rebooting the next day, my PyTorch script was slow and on investigation I found that not only was PyTorch not recognizing the GPU and throwing up a cuInit error but the ownership on /dev/nvidia0 were reset to root:root from the root:video I had changed it to. I went through a couple of pages here on the forum and after some google-fu ended up concluding that it was because of some issue with the Persistence Mode since that had some impact on permissions.

So I ended up on this dev page which had more details on the Persistence Mode. Along with this, a couple of other pages here on the also mentioned that if I run the following

su -
nvidia-smi -pm 1

that it should work.

This solution does indeed work. If I switch to the root user and then run nvidia-smi with the pm flag, it says that Persistence Mode is already Enabled but deviceQuery and PyTorch/Tensorflow work only after running this command. I don’t have to change groups or anything. Moreover, it was my understanding that this is what nvidia-persistenced is designed for. I shouldn’t have to manually run a sequence of commands every time I boot my machine to get this to work.

My main problem that I am unable to find a solution to and hope someone on here can help me with is – what am I doing wrong with the Persistence Mode that it isn’t available at boot up? Is this a Persistence Mode thing in the first place? Or a systemd thing? nvidia-smi says that Persistence Mode is On but I still have to run the command as root before my GPU is recognized by either PyTorch or Tensorflow.

systemctl status

root@debian:~# systemctl status nvidia-persistenced.service 
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/nvidia-persistenced.service.d
             └─override.conf
     Active: active (running) since Fri 2024-08-02 11:47:20 IST; 10h ago
    Process: 794 ExecStart=/usr/bin/nvidia-persistenced --persistence-mode --user=debian (code=exited, status=0/SUCCESS)
   Main PID: 828 (nvidia-persiste)
      Tasks: 1 (limit: 38180)
     Memory: 2.9M
        CPU: 911ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─828 /usr/bin/nvidia-persistenced --persistence-mode --user=debian

Aug 02 11:47:18 debian systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Aug 02 11:47:18 debian nvidia-persistenced[828]: Started (828)
Aug 02 11:47:20 debian systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.

The setup file for systemd (systemctl edit nvidia-persistenced.service as root)

[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=
ExecStart=/usr/bin/nvidia-persistenced --persistence-mode --user=debian
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

I’m quite at my wits’ end about this since it appears that I’ve followed the install guides properly and still have this issue. So I’d really appreciate it if someone could help me figure out what’s wrong.