Hello all,
I have a node with two A100s that I can access only remotely, via ssh. The node boots an ubuntu 20.04 image via tftp. Nvidia 460 driver was installed as part of cuda, following these instructions. I blacklisted nouveau, rebuilt the initrd image and made sure tftp uses it. But the nvidia module does not get built even though I think I have everything I need.
root@node21:/# dpkg -l nvidia*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-================================-==================-============-=====================================================
un nvidia-304 <none> <none> (no description available)
un nvidia-340 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
ii nvidia-compute-utils-460 460.27.04-0ubuntu1 amd64 NVIDIA compute utilities
ii nvidia-dkms-460 460.27.04-0ubuntu1 amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
ii nvidia-driver-460 460.27.04-0ubuntu1 amd64 NVIDIA driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-kernel-common <none> <none> (no description available)
ii nvidia-kernel-common-460 460.27.04-0ubuntu1 amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-source-460 460.27.04-0ubuntu1 amd64 NVIDIA kernel source package
un nvidia-legacy-304xx-vdpau-driver <none> <none> (no description available)
un nvidia-legacy-340xx-vdpau-driver <none> <none> (no description available)
un nvidia-libopencl1-dev <none> <none> (no description available)
ii nvidia-modprobe 460.27.04-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
ii nvidia-prime 0.8.14 all Tools to enable NVIDIA's Prime
ii nvidia-settings 460.27.04-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-460 460.27.04-0ubuntu1 amd64 NVIDIA driver support binaries
un nvidia-vdpau-driver <none> <none> (no description available)
All build components are also installed:
root@node21:/# dpkg -l make build-essential linux-headers-5.4.0*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============================-=============-============-========================================================
ii build-essential 12.8ubuntu1.1 amd64 Informational list of build-essential packages
ii linux-headers-5.4.0-58 5.4.0-58.64 all Header files related to Linux kernel version 5.4.0
ii linux-headers-5.4.0-58-generic 5.4.0-58.64 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
ii make 4.2.1-1.2 amd64 utility for directing compilation
Yet when I try to modprobe nvidia, it says it’s not there; /dev/nvidia* files are also missing and nvidia-modprobe doesn’t do anything. If I try to reinstall nvidia-dkms-460, I get the following:
root@node21:/home/users/andrej# apt reinstall nvidia-dkms-460
Reading package lists... Done
Building dependency tree
Reading state information... Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 0 not upgraded.
Need to get 29.5 kB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 nvidia-dkms-460 460.27.04-0ubuntu1 [29.5 kB]
Fetched 29.5 kB in 0s (221 kB/s)
(Reading database ... 96625 files and directories currently installed.)
Preparing to unpack .../nvidia-dkms-460_460.27.04-0ubuntu1_amd64.deb ...
Removing all DKMS Modules
Done.
Unpacking nvidia-dkms-460 (460.27.04-0ubuntu1) over (460.27.04-0ubuntu1) ...
Setting up nvidia-dkms-460 (460.27.04-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)
A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf
A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`
*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can ***
*** be loaded. ***
*****************************************************************************
INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
Loading new nvidia-460.27.04 DKMS files...
Building for 5.4.0-58-generic
Building for architecture x86_64
Module build for kernel 5.4.0-58-generic was skipped since the
kernel headers for this kernel does not seem to be installed.
Processing triggers for initramfs-tools (0.136ubuntu6.3) ...
Long story short, I tried everything I could think of and I still can’t get the module to build. I’m attaching the debug log for further information in case it’s helpful.
Thanks in advance for any and all insight and suggestions on what to try next!
nvidia-bug-report.log.gz (72.4 KB)