Headless K80 system: NVIDIA drivers (/dev/nvidia) not created at install time

j.r.parsons · March 13, 2021, 11:08pm

I have a brand new PC with a used K80, running Ubuntu 20.04, that is failing to install the NVIDIA drivers during the CUDA installation process. The process appears to complete reliably through step 3.8 of the instructions (sudo apt install cuda) which cranks and grinds for quite some time unpacking libraries and installing them. When it’s done, though, nvidia-smi fails with this error message:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

This is a very common error message! I looked through the forums and found lots of great hints, but none of them have worked. Here are a few things I have ruled out:

The hardware is detected - I’ve verified my kernel version (5.10.0) and gcc version (9.3.0). lspci sees my K80 as a pair of GPUs with addresses on the PCIe bus, as well as the integrated graphics chip my CPU uses.
Secure boot is disabled - I have used sudo mokutil --sbstate to ensure that Secure Boot is Disabled. (It also says that “platform is in setup state” which suggests that I could enable it and store a key, but as long as I don’t, it will stay disabled.)
Grub is booting the right kernel. This got me once, and I had to back all the way out and start over. I am running 5.10.0 low latency, and I wrote a shell script to automate all the version checks before I start messing with .deb files.
Blacklists are cleaned out - I made sure to remove any blacklist entries that were preventing NVIDIA from running
Nouveau and Wayland are not the issue - both have been removed or disabled
Xorg is failing (but that’s okay?) - My system is headless and I’m using NoMachine to remote in on port 4000. I don’t know the exact mechanism NoMachine uses, but it seems like an X-server. In any case, it seems to be doing great.

Here are some things I think are probably not okay:

nvidia-persistenced.service is failing, and systemctl status nvidia-persistenced says the driver might not even be in /dev/nvidia. (There’s nothing in /dev/nvidia - that’s bad, right?)
/var/log/syslog is full of spammed messages about modprobe failing to initialize the driver. That error message is about as close as I can come to the root cause, I think:

Mar 13 18:04:52 obsidian systemd-udevd[330]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.
Mar 13 18:04:52 obsidian systemd[1]: nvidia-persistenced.service: Start request repeated too quickly.
Mar 13 18:04:52 obsidian systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Mar 13 18:04:52 obsidian systemd[1]: Failed to start NVIDIA Persistence Daemon.
Mar 13 18:04:52 obsidian kernel: [ 7304.084235] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
Mar 13 18:04:52 obsidian kernel: [ 7304.084242] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.084242] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.084242] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.084632] nvidia: probe of 0000:03:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.084643] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.084643] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.084643] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.084645] nvidia: probe of 0000:04:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.084665] NVRM: The NVIDIA probe routine failed for 2 device(s).
Mar 13 18:04:52 obsidian kernel: [ 7304.084665] NVRM: None of the NVIDIA devices were initialized.
Mar 13 18:04:52 obsidian kernel: [ 7304.084812] nvidia-nvlink: Unregistered the Nvlink Core, major device number 511
Mar 13 18:04:52 obsidian systemd-udevd[330]: nvidia: Process '/sbin/modprobe nvidia-modeset' failed with exit code 1.
Mar 13 18:04:52 obsidian kernel: [ 7304.254017] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
Mar 13 18:04:52 obsidian kernel: [ 7304.254022] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.254022] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.254022] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.254448] nvidia: probe of 0000:03:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.254459] NVRM: request_mem_region failed for 0M @ 0x0. This can
Mar 13 18:04:52 obsidian kernel: [ 7304.254459] NVRM: occur when a driver such as rivatv is loaded and claims
Mar 13 18:04:52 obsidian kernel: [ 7304.254459] NVRM: ownership of the device's registers.
Mar 13 18:04:52 obsidian kernel: [ 7304.254460] nvidia: probe of 0000:04:00.0 failed with error -1
Mar 13 18:04:52 obsidian kernel: [ 7304.254476] NVRM: The NVIDIA probe routine failed for 2 device(s).
Mar 13 18:04:52 obsidian kernel: [ 7304.254477] NVRM: None of the NVIDIA devices were initialized.
Mar 13 18:04:52 obsidian kernel: [ 7304.254593] nvidia-nvlink: Unregistered the Nvlink Core, major device number 511

I don’t see a driver (“such as rivatv”) seizing control, unless it’s the elephant in the room – the i915 display driver that runs on my CPU and powers the HDMI connection I’m not using. What steps should I take next? I am wary of putting my system in a state where I can’t even go downstairs and connect a monitor to it to troubleshoot, so I feel like I’m at a dead end.

nvidia-bug-report.log.gz (35.1 KB)

generix · March 13, 2021, 11:18pm

Please check your bios for an option “Above 4G decoding” or “large/64bit BARs” and enable it, possibly also disabling CSM and reinstall in EFI mode if not already done.

j.r.parsons · March 13, 2021, 11:23pm

Thanks - I had not thought any motherboard/BIOS features could be the cause, but this sounds promising. When you suggest “reinstall” do you mean just the deb packages, or the entire OS?

generix · March 13, 2021, 11:55pm

The entire OS, but only if enabling the bios option does not work or prevents booting. Due to the error message flooding the logs I couldn’t see any details.

j.r.parsons · March 14, 2021, 2:25pm

Yeah, you weren’t kidding – changing those settings in my BIOS made my partition unbootable. I’m reinstalling Ubuntu 20.04 today, although the files on the HDD should be intact… all I should really have to do is get a new working initramfs stood up in grub pointing to the existing filesystem. In other words, I changed those settings which changed the interface between my BIOS and grub, so if I can fix those, the rest of the filesystem (and all my settings?) should still be accessible. We’ll see!

generix · March 14, 2021, 4:08pm

I don’t think that’ll work, EFI needs a different partition table GPT while you likely have an old MBR pt.

j.r.parsons · March 14, 2021, 4:26pm

Okay, well, I’m not going to be able to reinstall my whole OS this weekend, but this is a good lead. It sure would have been nice to know this before I spent last weekend building a system and configuring my OS… wish NVIDIA’s CUDA instructions included some OS- and BIOS-level “common stumbling blocks” so people didn’t waste their time, especially when the solution for so many of these things is to reinstall the whole OS.

generix · March 14, 2021, 6:32pm

It’s actually there but you can only find it if you know what you’re looking for:
https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#bar-sizes

j.r.parsons · March 21, 2021, 5:07pm

Alright. Despite turning on Above 4G and large BARs in my BIOS menu, and a full reinstall of Ubuntu in a GPT-style partition table, I am still running into essentially the same problem. I’ve disabled nouveau and wayland and removed the NVMe hard drive that was causing my motherboard to have boot problems.

As a bonus, now my system will not show video at resolutions greater than 1024x768, and while modprobe i915 seems to succeed, all three of my VGA devices (onboard and my two K80s) say “Display: UNCLAIMED”. Seems fine for the K80s to not claim a display but I remember that when it was working before my integrated video was catching a display and doing high res easily.

I would try to dig in and fix that myself but getting the K80s to play nice with CUDA is my priority. As you can see below, the error is basically the same as it was before.

nvidia-bug-report.log.gz (514.1 KB)

generix · March 21, 2021, 5:24pm

64bit resources are properly enabled now but the bios doesn’t play nice. Please add kernel parameter
pci=realloc
then reboot and provide a dmesg output.

generix · March 21, 2021, 5:33pm

Actually, I don’t think that will work with that mainboard/bios. The upstream pci bridge 01.1 only gets a memory windows of 16GB but the K80 is a dual gpu design which wants 2x16GB memory windows.

j.r.parsons · March 21, 2021, 6:00pm

I’m pretty frustrated with this motherboard anyway, so swapping it out for a different model wouldn’t be the end of the world. Are all B560 motherboards likely to have this issue, or would another vendor’s implementation potentially behave better?

If it’s a limitation across all B560s, where would I look up the specifications so I can find something that’s known to play nicely with Ubuntu and allows the allocation of a sufficient Qty x Size of memory windows?

generix · March 21, 2021, 6:57pm

It’s a plain bios issue, result of consumer vs. workstation/server hardware. On consumer boards, the pci resources assigned by the bios often are only checked to drive a standard graphics card. So it’s really trial and error. Only many workstation/server board biosses are designed with that in mind.
Did you check to boot with pci=ralloc nevertheless? Since the gpus are sitting on the cpu bridge, it might also get reassigned.