I have a brand new PC with a used K80, running Ubuntu 20.04, that is failing to install the NVIDIA drivers during the CUDA installation process. The process appears to complete reliably through step 3.8 of the instructions (sudo apt install cuda
) which cranks and grinds for quite some time unpacking libraries and installing them. When it’s done, though, nvidia-smi
fails with this error message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
This is a very common error message! I looked through the forums and found lots of great hints, but none of them have worked. Here are a few things I have ruled out:
- The hardware is detected - I’ve verified my kernel version (5.10.0) and gcc version (9.3.0).
lspci
sees my K80 as a pair of GPUs with addresses on the PCIe bus, as well as the integrated graphics chip my CPU uses. - Secure boot is disabled - I have used
sudo mokutil --sbstate
to ensure that Secure Boot is Disabled. (It also says that “platform is in setup state” which suggests that I could enable it and store a key, but as long as I don’t, it will stay disabled.) - Grub is booting the right kernel. This got me once, and I had to back all the way out and start over. I am running 5.10.0 low latency, and I wrote a shell script to automate all the version checks before I start messing with .deb files.
- Blacklists are cleaned out - I made sure to remove any blacklist entries that were preventing NVIDIA from running
- Nouveau and Wayland are not the issue - both have been removed or disabled
- Xorg is failing (but that’s okay?) - My system is headless and I’m using NoMachine to remote in on port 4000. I don’t know the exact mechanism NoMachine uses, but it seems like an X-server. In any case, it seems to be doing great.
Here are some things I think are probably not okay:
- nvidia-persistenced.service is failing, and
systemctl status nvidia-persistenced
says the driver might not even be in/dev/nvidia
. (There’s nothing in/dev/nvidia
- that’s bad, right?) - /var/log/syslog is full of spammed messages about modprobe failing to initialize the driver. That error message is about as close as I can come to the root cause, I think:
Mar 13 18:04:52 obsidian systemd-udevd[330]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
Mar 13 18:04:52 obsidian systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Mar 13 18:04:52 obsidian systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Mar 13 18:04:52 obsidian systemd[1]: Failed to start NVIDIA Persistence Daemon.
Mar 13 18:04:52 obsidian kernel: [ 7304.084235] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
Mar 13 18:04:52 obsidian kernel: [ 7304.084242] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.084242] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.084242] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.084632] nvidia: probe of 0000:03:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.084643] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.084643] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.084643] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.084645] nvidia: probe of 0000:04:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.084665] NVRM: The NVIDIA probe routine failed for 2 device(s).
Mar 13 18:04:52 obsidian kernel: [ 7304.084665] NVRM: None of the NVIDIA devices were initialized.
Mar 13 18:04:52 obsidian kernel: [ 7304.084812] nvidia-nvlink: Unregistered the Nvlink Core, major device number 511
Mar 13 18:04:52 obsidian systemd-udevd[330]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
Mar 13 18:04:52 obsidian kernel: [ 7304.254017] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
Mar 13 18:04:52 obsidian kernel: [ 7304.254022] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.254022] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.254022] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.254448] nvidia: probe of 0000:03:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.254459] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.254459] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.254459] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.254460] nvidia: probe of 0000:04:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.254476] NVRM: The NVIDIA probe routine failed for 2 device(s).
Mar 13 18:04:52 obsidian kernel: [ 7304.254477] NVRM: None of the NVIDIA devices were initialized.
Mar 13 18:04:52 obsidian kernel: [ 7304.254593] nvidia-nvlink: Unregistered the Nvlink Core, major device number 511
I don’t see a driver (“such as rivatv”) seizing control, unless it’s the elephant in the room – the i915 display driver that runs on my CPU and powers the HDMI connection I’m not using. What steps should I take next? I am wary of putting my system in a state where I can’t even go downstairs and connect a monitor to it to troubleshoot, so I feel like I’m at a dead end.
nvidia-bug-report.log.gz (35.1 KB)