Failed to install driver for NVIDIA A2 on Debian 12

  • Bref story: I had remote Debian 12 server with NVIDIA A2. Can’t install driver. I try debian way via articles by the apt, and try downloading the driver run file.
    But all my attempts failed in the end.
    Need ideas, how to install driver.
    Or, may be my device is broken, how to make shure?

  • Discover GPU device:

  • lspci -nn | grep VGA

    • 00:01.0 VGA compatible controller [0300]: Device [1234:1111] (rev 02)
    • 06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [A2 / A16] [10de:25b6] (rev a1)
    • 10de:25b6 (vendor and device ID)
    • “NVIDIA A2 Tensor Core GPU”
  • First i try follow article from Debian:

    • Only one precompiled driver version is available for = Version 525.105.17
      • Device with ID 25b6 found in [supported devices]
  • remove previous drivers

    • sudo apt-get remove --purge ‘^nvidia-.*’

    • sudo apt-get remove --purge ‘^libnvidia-.*’

    • sudo apt-get remove --purge ‘^cuda-.*’

    • apt autoremove

  • reboot

  • setup drivers by the article:

  • apt update

  • apt upgrade

  • apt install nvidia-driver firmware-misc-nonfree

  • I see that actually installed version 535.183.01-1, it is newer, than mention in article

  • mokutil --sb-state

    • SecureBoot disabled
    • So, no need to sign the resulting modules
  • reboot

  • nvidia-smi

    • NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
  • nvidia-bug-report.sh

    • shows me that Nvidia services are dead
  • Try to enable and manual start Nvidia services

    • sudo systemctl start nvidia-suspend.service

    • sudo systemctl start nvidia-hibernate.service

    • sudo systemctl start nvidia-resume.service

    • sudo systemctl start nvidia-suspend.service

    • sudo systemctl start nvidia-hibernate.service

    • sudo systemctl start nvidia-resume.service

  • Check, did they alive?

    • systemctl status nvidia-suspend.service nvidia-hibernate.service nvidia-resume.service

      • no, still dead
  • Let’s try download driver run file, recommended by Debian version

  • again remove previous drivers

    • sudo apt-get remove --purge ‘^nvidia-.*’

    • sudo apt-get remove --purge ‘^libnvidia-.*’

    • sudo apt-get remove --purge ‘^cuda-.*’

    • apt autoremove

  • reboot

  • Download and run manually driver [version 525.105.17], that was mention as only suitable for the Debian 12

  • sh NVIDIA-Linux-x86_64-525.105.17.run

    • Failed with message:
      Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
      Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/var/log/nvidia-installer.log’ for more information.
    • 4 Reasons to explore:
        1. kernel module was built against the wrong or improperly configured kernel sources
        • let’s check kernel version
        • uname -r

          • 6.1.0-22-amd64
        • dpkg --list | grep linux-image
          • presented 6.1.0-18, 6.1.0-21, 6.1.0-22
        • lets remove all except current 6.1.0-22
          • dpkg --purge linux-image-6.1.0-18-amd64
          • dpkg --purge linux-image-6.1.0-21-amd64
        • lets remove old kernel sources
          • apt remove linux-headers-6.1.0-21-amd64
          • apt remove linux-headers-6.1.0-21-common
        • so we already have src of the kernel version, removing old src have not changed compilation errors.
          Looks like we are good there.
        • just in case i reinstall linux-headers but it has no effect
        • sudo apt install --reinstall linux-headers-$(uname -r

        1. version of gcc that differs from the one used to build the target kernel
        • cat /proc/version

          • Linux version 6.1.0-22-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21)
        • #gcc --version
          • gcc (Debian 12.2.0-14) 12.2.0
        • They are match
        1. Another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s)
        • lsmod | grep -E ‘nvidia|nouveau’
        • show nothing.
        1. no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release
        • We have device.
          lspci -nn | grep VGA
    • Relevant error from the ‘/var/log/nvidia-installer.log’:
      • Skipping BTF generation for /tmp/selfgz1212/NVIDIA-Linux-x86_64-525.105.17/kernel/nvidia.ko due to unavailability of vmlinux

      • → Kernel module load error: No such device

      • → Kernel messages:

      • [ 695.387572] nvidia 0000:06:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none

      • [ 695.387682] NVRM: The NVIDIA GPU 0000:06:00.0 (PCI ID: 10de:25b6)

      • NVRM: installed in this system is not supported by the

      • NVRM: NVIDIA 525.105.17 driver release.

      • NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’

      • NVRM: in this release’s README, available on the operating system

      • NVRM: specific graphics driver download page at www.nvidia.com.

      • [ 695.388191] nvidia: probe of 0000:06:00.0 failed with error -1

      • [ 695.388216] NVRM: The NVIDIA probe routine failed for 1 device(s).

      • [ 695.388217] NVRM: None of the NVIDIA devices were initialized.

      • [ 695.388851] nvidia-nvlink: Unregistered Nvlink Core, major device number 241

    • Why vmlinux is unavailable to compiler ?

I try to install via run files drivers 525.105.17, 550.90.07 but in nvidia-installer.log always same error (only version of a driver is different) close to the end of the log:

NVRM: The NVIDIA GPU 0000:06:00.0 (PCI ID: 10de:25b6)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 550.90.07 driver release.

But first i see in the log errors like:
Skipping BTF generation for /tmp/selfgz1205/NVIDIA-Linux-x86_64-525.105.17/kernel/nvidia.ko due to unavailability of vmlinux

These errors also same with 525.105.17, 550.90.07 drivers.

Please advice, which errors in this log generates first?
Do i have problem with vmlinux, or i can’t find driver that supports my card ?