Nvidia Driver/GPU fails to detect display/monitor / Freezing at "NVIDIA persistence daemon" - Debian RTX a4000

Hello all,

I’ll begin by stating that I don’t know what the “actual” problem is and I’m not very experienced in Linux, so I’ll try to provide as much information as possible but will probably miss something.

I have a machine I’m setting up for machine learning. It’s running a AMD 1600x and RTX a4000.

The issue: When the computer boots into BIOS, then allows me to select the OS, then Linux starts the Runlevel programs, and when it gets to the “NVIDIA persistence daemon”, that’s when the screen “freezes”. Secure boot is not enabled. But the computer still works just fine when SSH’d in. It’s an older AMD chip so there is no integrated graphics.

Nouveau should be disabled after being blacklisted by following Install Nvidia Drivers on Debian/Ubuntu | Kinetica Docs and is confirmed by running the command

sudo lsmod | grep nouveau

which returns blank

The drivers are installed correctly as best I can tell. My APT sources look like this:

deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware

deb http://security.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware
deb-src http://security.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware

# bookworm-updates, to get updates before a point release is made;
# see https://www.debian.org/doc/manuals/debian-reference/ch02.en.html#_updates_and_backports

deb http://deb.debian.org/debian/ bookworm-updates main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ bookworm-updates main contrib non-free non-free-firmware

# This system was installed using small removable media
# (e.g. netinst, live or single CD). The matching "deb cdrom"
# entries were disabled at the end of the installation process.
# For information about how to configure apt package sources,
# see the sources.list(5) manual.

And the system recognizes the hardware by running
lspci | grep -i “nvidia”

1c:00.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
1c:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)

Packages found with
dpkg -l | grep -i nvidia

ii  firmware-nvidia-gsp                     525.125.06-1~deb12u1                amd64        NVIDIA GSP firmware
ii  glx-alternative-nvidia                  1.2.2                               amd64        allows the selection of NVIDIA as GLX provider 
ii  libcuda1:amd64                          525.125.06-1~deb12u1                amd64        NVIDIA CUDA Driver Library 
ii  libegl-nvidia0:amd64                    525.125.06-1~deb12u1                amd64        NVIDIA binary EGL library 
ii  libgl1-nvidia-glvnd-glx:amd64           525.125.06-1~deb12u1                amd64        NVIDIA binary OpenGL/GLX library (GLVND variant) 
ii  libgles-nvidia1:amd64                   525.125.06-1~deb12u1                amd64        NVIDIA binary OpenGL|ES 1.x library 
ii  libgles-nvidia2:amd64                   525.125.06-1~deb12u1                amd64        NVIDIA binary OpenGL|ES 2.x library 
ii  libglx-nvidia0:amd64                    525.125.06-1~deb12u1                amd64        NVIDIA binary GLX library 
ii  libnvcuvid1:amd64                       525.125.06-1~deb12u1                amd64        NVIDIA CUDA Video Decoder runtime library 
ii  libnvidia-allocator1:amd64              525.125.06-1~deb12u1                amd64        NVIDIA allocator runtime library 
ii  libnvidia-cfg1:amd64                    525.125.06-1~deb12u1                amd64        NVIDIA binary OpenGL/GLX configuration library 
ii  libnvidia-egl-gbm1:amd64                1.1.0-2                             amd64        GBM EGL external platform library for NVIDIA 
ii  libnvidia-egl-wayland1:amd64            1:1.1.10-1                          amd64        Wayland EGL External Platform library -- shared library 
ii  libnvidia-eglcore:amd64                 525.125.06-1~deb12u1                amd64        NVIDIA binary EGL core libraries 
ii  libnvidia-encode1:amd64                 525.125.06-1~deb12u1                amd64        NVENC Video Encoding runtime library 
ii  libnvidia-glcore:amd64                  525.125.06-1~deb12u1                amd64        NVIDIA binary OpenGL/GLX core libraries 
ii  libnvidia-glvkspirv:amd64               525.125.06-1~deb12u1                amd64        NVIDIA binary Vulkan Spir-V compiler library 
ii  libnvidia-ml1:amd64                     525.125.06-1~deb12u1                amd64        NVIDIA Management Library (NVML) runtime library 
ii  libnvidia-ptxjitcompiler1:amd64         525.125.06-1~deb12u1                amd64        NVIDIA PTX JIT Compiler library 
ii  libnvidia-rtcore:amd64                  525.125.06-1~deb12u1                amd64        NVIDIA binary Vulkan ray tracing (rtcore) library ii  nvidia-alternative                      525.125.06-1~deb12u1                amd64        allows the selection of NVIDIA as GLX provider 
ii  nvidia-driver                           525.125.06-1~deb12u1                amd64        NVIDIA metapackage ii  nvidia-driver-bin                       525.125.06-1~deb12u1                amd64        NVIDIA driver support binaries 
ii  nvidia-driver-libs:amd64                525.125.06-1~deb12u1                amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) 
ii  nvidia-egl-common                       525.125.06-1~deb12u1                amd64        NVIDIA binary EGL driver - common files 
ii  nvidia-egl-icd:amd64                    525.125.06-1~deb12u1                amd64        NVIDIA EGL installable client driver (ICD) 
ii  nvidia-installer-cleanup                20220217+3~deb12u1                  amd64        cleanup after driver installation with the nvidia-installer 
ii  nvidia-kernel-common                    20220217+3~deb12u1                  amd64        NVIDIA binary kernel module support files 
ii  nvidia-kernel-dkms                      525.125.06-1~deb12u1                amd64        NVIDIA binary kernel module DKMS source 
ii  nvidia-kernel-support                   525.125.06-1~deb12u1                amd64        NVIDIA binary kernel module support files 
ii  nvidia-legacy-check                     525.125.06-1~deb12u1                amd64        check for NVIDIA GPUs requiring a legacy driver 
ii  nvidia-modprobe                         535.54.03-1~deb12u1                 amd64        utility to load NVIDIA kernel modules and create device nodes 
ii  nvidia-persistenced                     525.85.05-1                         amd64        daemon to maintain persistent software state in the NVIDIA driver 
ii  nvidia-settings                         525.125.06-1~deb12u1                amd64        tool for configuring the NVIDIA graphics driver 
ii  nvidia-smi                              525.125.06-1~deb12u1                amd64        NVIDIA System Management Interface 
ii  nvidia-support                          20220217+3~deb12u1                  amd64        NVIDIA binary graphics driver support files 
ii  nvidia-vdpau-driver:amd64               525.125.06-1~deb12u1                amd64        Video Decode and Presentation API for Unix - NVIDIA driver 
ii  nvidia-vulkan-common                    525.125.06-1~deb12u1                amd64        NVIDIA Vulkan driver - common files 
ii  nvidia-vulkan-icd:amd64                 525.125.06-1~deb12u1                amd64        NVIDIA Vulkan installable client driver (ICD) 
ii  xserver-xorg-video-nvidia               525.125.06-1~deb12u1                amd64        NVIDIA binary Xorg driver

And to confirm that everything is installed correctly, I can run:
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     | 
|-------------------------------+----------------------+----------------------+ 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | 
|                               |                      |               MIG M. | 
|===============================+======================+======================| 
|   0  NVIDIA RTX A4000    On   | 00000000:1C:00.0 Off |                  Off | 
| 41%   40C    P8    18W / 140W |      1MiB / 16376MiB |      0%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ 
| Processes:                                                                  | 
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory | 
|        ID   ID                                                   Usage      | 
|=============================================================================| 
|  No running processes found                                                 | 
+-----------------------------------------------------------------------------+

I’ve uninstalled and reinstalled Nvidia drivers.

I noticed that Disp.A = Off so for some reason, the card is not detecting the monitor even though it is plugged in. My first thought was to force a resolution, but enabling GRUB_GFXMODE=640x480 and running update grub did not change the problem.

And I attempted to switch to tty2 by using ALT-F2 (or CTRL-ALT-F2) but the computer was unresponsive. I ran CTRL-ALT-CEL to ensure the computer was receiving keyboard inputs and it restarted as expected

And for those who are interested,

inxi -Fxxxzra

System:
  Kernel: 6.1.0-13-amd64 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
    parameters: BOOT_IMAGE=/boot/vmlinuz-6.1.0-13-amd64
    root=UUID=8ae679d5-60b9-4d97-aead-8dc82d1bd400 ro quiet rd.driver.blacklist=grub.nouveau
    rcutree.rcu_idle_gp_delay=1 quiet nouveau.modeset=0
  Console: pty pts/0 DM: GDM3 v: 43.0 Distro: Debian GNU/Linux 12 (bookworm)
Machine:
  Type: Desktop Mobo: Micro-Star model: B350 TOMAHAWK (MS-7A34) v: 1.0
    serial: <superuser required> UEFI-[Legacy]: American Megatrends v: 1.M0 date: 01/23/2019
CPU:
  Info: model: AMD Ryzen 5 1600 bits: 64 type: MT MCP arch: Zen level: v3 note: check
    built: 2017-19 process: GF 14nm family: 0x17 (23) model-id: 1 stepping: 1 microcode: 0x8001137
  Topology: cpus: 1x cores: 6 tpc: 2 threads: 12 smt: enabled cache: L1: 576 KiB
    desc: d-6x32 KiB; i-6x64 KiB L2: 3 MiB desc: 6x512 KiB L3: 16 MiB desc: 2x8 MiB
  Speed (MHz): avg: 1654 high: 2800 min/max: 1550/3200 boost: enabled scaling:
    driver: acpi-cpufreq governor: schedutil cores: 1: 1550 2: 2800 3: 1550 4: 1550 5: 1550 6: 1550
    7: 1550 8: 1550 9: 1550 10: 1550 11: 1550 12: 1550 bogomips: 76791
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities:
  Type: gather_data_sampling status: Not affected
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: retbleed mitigation: untrained return thunk; SMT vulnerable
  Type: spec_rstack_overflow mitigation: safe RET
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization
  Type: spectre_v2 mitigation: Retpolines, IBPB: conditional, STIBP: disabled, RSB filling,
    PBRSB-eIBRS: Not affected
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: NVIDIA GA104GL [RTX A4000] vendor: Lenovo driver: nvidia v: 525.125.06
    non-free: 530.xx+ status: current (as of 2023-03) arch: Ampere code: GAxxx
    process: TSMC n7 (7nm) built: 2020-22 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 link-max: gen: 4
    speed: 16 GT/s bus-ID: 1c:00.0 chip-ID: 10de:24b0 class-ID: 0300
  Display: server: X.org v: 1.21.1.7 with: Xwayland v: 22.1.9 driver: N/A note: X driver n/a
    tty: 157x85
  API: OpenGL Message: GL data unavailable in console. Try -G --display
Audio:
  Device-1: NVIDIA GA104 High Definition Audio vendor: Lenovo driver: snd_hda_intel v: kernel
    pcie: gen: 1 speed: 2.5 GT/s lanes: 16 link-max: gen: 4 speed: 16 GT/s bus-ID: 1c:00.1
    chip-ID: 10de:228b class-ID: 0403
  Device-2: AMD Family 17h HD Audio vendor: Micro-Star MSI driver: snd_hda_intel v: kernel pcie:
    gen: 3 speed: 8 GT/s lanes: 16 bus-ID: 1e:00.3 chip-ID: 1022:1457 class-ID: 0403
  API: ALSA v: k6.1.0-13-amd64 status: kernel-api tools: alsamixer,amixer
  Server-1: PipeWire v: 0.3.65 status: active with: 1: pipewire-pulse status: active
    2: wireplumber status: active 3: pipewire-alsa type: plugin tools: pw-cat,pw-cli,wpctl
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Micro-Star MSI
    driver: r8169 v: kernel pcie: gen: 1 speed: 2.5 GT/s lanes: 1 port: f000 bus-ID: 19:00.0
    chip-ID: 10ec:8168 class-ID: 0200
  IF: enp25s0 state: up speed: 100 Mbps duplex: full mac: <filter>
  IF-ID-1: docker0 state: down mac: <filter>
Drives:
  Local Storage: total: 931.51 GiB used: 7.32 GiB (0.8%)
  SMART Message: Required tool smartctl not installed. Check --recommends
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 980 1TB size: 931.51 GiB
    block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
    rev: 3B4QFXO7 temp: 38.9 C scheme: MBR
Partition:
  ID-1: / raw-size: 930.56 GiB size: 914.88 GiB (98.31%) used: 7.32 GiB (0.8%) fs: ext4
    dev: /dev/nvme0n1p1 maj-min: 259:1
Swap:
  Kernel: swappiness: 60 (default) cache-pressure: 100 (default)
  ID-1: swap-1 type: partition size: 976 MiB used: 0 KiB (0.0%) priority: -2 dev: /dev/nvme0n1p5
    maj-min: 259:3
Sensors:
  System Temperatures: cpu: 38.9 C mobo: N/A gpu: nvidia temp: 41 C
  Fan Speeds (RPM): N/A
Repos:
  Packages: pm: dpkg pkgs: 1746 libs: 1005 tools: apt,apt-get,gnome-software,synaptic
  Active apt repos in: /etc/apt/sources.list
    1: deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
    2: deb-src http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
    3: deb http://security.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware
    4: deb-src http://security.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware
    5: deb http://deb.debian.org/debian/ bookworm-updates main contrib non-free non-free-firmware
    6: deb-src http://deb.debian.org/debian/ bookworm-updates main contrib non-free non-free-firmware
  No active apt repos in: /etc/apt/sources.list.d/docker.list
Info:
  Processes: 216 Uptime: 31m wakeups: 0 Memory: 31.29 GiB used: 958.2 MiB (3.0%) Init: systemd
  v: 252 target: multi-user (3) default: multi-user tool: systemctl Compilers: gcc: 12.2.0 alt: 12
  Shell: Bash v: 5.2.15 running-in: pty pts
[nvidia-bug-report.log.gz|attachment](upload://xMGE2iJbqAal4DUP3fjOnDvKXUE.gz) (314.4 KB)
/0 (SSH) inxi: 3.3.26

Thank you kindly
nvidia-bug-report.log.gz (314.4 KB)
systemd.txt (91.3 KB)

Since luckily ssh still works, please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Sorry, I thought I did but it didn’t save, apparently.

File has been uploaded

Looks fine, there just doesn’t start any Desktop Environment or getty. Please run
sudo journalctl -b0 |grep systemd >systemd.txt
and attach systemd.txt to your post.

This has been completed.

EDIT:
It’s one of these lines that are the last thing that loads before the screen freezes. Usually it’s the “Started nvidia-persistenced.service - NVIDIA Persistence Daemon.”, but sometimes it stops just before or just after.

Nov 07 11:44:30 Oxidian systemd[1]: Started NetworkManager-dispatcher.service - Network Manager Script Dispatcher Service.
Nov 07 11:44:30 Oxidian systemd[1]: Finished NetworkManager-wait-online.service - Network Manager Wait Online.
Nov 07 11:44:31 Oxidian systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.
Nov 07 11:44:34 Oxidian systemd[1]: Finished networking.service - Raise network interfaces.

Looks ok. A getty is started on tty1, you should be able to just hit ‘enter’ to get a login prompt. Gnome seems to be installed but not set to start on boot. What exactly do you expect to happen, a DE starting or just a VT login prompt?

VT login prompt - the DE should be disabled

Ok, so does hitting “enter” give you a login prompt? If not, already tried a different keyboard?

Unfortunately, hitting “enter” does nothing. I’ve attempting using ALT-F2 to switch to a different terminal, also nothing. I can confirm that the keyboard works because CTRL+ALT+DEL forces a reboot as expected.

And I double checked to make sure that it wasn’t a problem with the DP to DVI adapter and plugged it into a monitor that had DP capabilities, no change

Ok, then this is the case where the nvidia driver for unknown (to me) reasons fails to provide a usable console. Nothing is freezing, just the text output is broken.
Things to try:

  • set kernel parameter nvidia-drm.modeset=1
  • try a different driver version
  • upgrade to driver v545 and try the new fbdev parameter
  • reinstall the OS in efi mode

For documentation purposes, this is what my grub settings are right now:

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet nouveau.modeset=0"
GRUB_CMDLINE_LINUX="quiet rd.driver.blacklist=grub.nouveau rcutree.rcu_idle_gp_delay=1"

Changed to:

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet nouveau.modeset=0"
GRUB_CMDLINE_LINUX="quiet rd.driver.blacklist=grub.nouveau rcutree.rcu_idle_gp_delay=1 nvidia-drm.modeset=1"

Regenerate /boot/grub/grub.cfg:
grub-mkconfig -o /boot/grub/grub.cfg
Reboot into and execute:
sudo update-grub
Screen freezes at a different point now, but effectively the same outcome.

Upgraded to driver v545:
Install GPG key

curl -fSsL https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub | sudo gpg --dearmor | sudo tee /usr/share/keyrings/nvidia-drivers.gpg > /dev/null 2>&1

Add repo to sources:

echo 'deb [signed-by=/usr/share/keyrings/nvidia-drivers.gpg] https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/ /' | sudo tee /etc/apt/sources.list.d/nvidia-drivers.list

And then updated:
sudo apt update && sudo apt upgrade
And this is where I’m getting a little stuck. From what I’ve read, fbdev parameter needs to = 1 but I can’t seem to find instructions on where it’s supposed to go. Does it need to be added to the end like so?

GRUB_CMDLINE_LINUX="quiet rd.driver.blacklist=grub.nouveau rcutree.rcu_idle_gp_delay=1 nvidia-drm.modeset=1 nvidia-drm.fbdev=1

Thanks again.

Should be correct. Though there doesn’t seem to be any docs about it, yet.
Please create a new nvidia-bug-report.log with it set.

New grub file configuration:

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet nouveau.modeset=0"
GRUB_CMDLINE_LINUX="quiet rd.driver.blacklist=grub.nouveau rcutree.rcu_idle_gp_delay=1 nvidia-drm.modeset=1 nvidia-drm.modeset=1"

New bug report file:
nvidia-bug-report.log.gz (322.1 KB)

Now you just set modeset=1 twice.

Ooops, copypasta’d the wrong stuff.

IT WORKS! Thank you very much for your help. I suppose it was just a driver issue?

Yes, rare but annoying.