Issues loading driver on VMware virtualised Ubuntu 18.04

Hi,

we have a system with 2 x GV100GL (Tesla V100 PCIe 16GB). This system is running with VMware ESXi 6.7. In that hypervisor, we have the GPU configured for “PCI Passthrough” and assigned one of the cards to a VM which is installed with Ubuntu 18.04 LTS. Once in that system, the card is recognized:

# lspci | grep NVIDIA
13:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

I downloaded the driver NVIDIA-Linux-x86_64-418.43.run and installed it like this:

# ./NVIDIA-Linux-x86_64-418.43.run --no-opengl-files --dkms -s

At the end of that process, I see the following error:

ERROR: Unable to load the 'nvidia-drm' kernel module.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

In that logfile, not more than these two lines are written, regarding the issue.

dmesg seems to have additional info, but neither do I understand what the issue means, nor can I find that on the net:

[  291.353568] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[  291.354057] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[  291.354058] NVRM: The system BIOS may have misconfigured your GPU.
[  291.354062] nvidia: probe of 0000:13:00.0 failed with error -1
[  291.354076] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  291.354076] NVRM: None of the NVIDIA graphics adapters were initialized!
[  291.354210] nvidia-nvlink: Unregistered the Nvlink Core, major device number 243

I could not find something on the net matching this virtualization setup and issue.
Please assist.

BR,
Marc
nvidia-bug-report.log.gz (46.8 KB)
nvidia-installer.log (2.2 KB)

I don’t think this is virtualization specific, the same problem has been reported several times recently for bare metal installs. At some time the kernel introduced a bug regarding resource allocation:

[    0.274904] pci 0000:13:00.0: BAR 1: no space for [mem size 0x400000000 64bit pref]
[    0.274956] pci 0000:13:00.0: BAR 1: failed to assign [mem size 0x400000000 64bit pref]

In your case, it’s trying to map 16GB, which doesn’t work.
At least to me, reason and circumstances are unknown. You can only try up/downgrading the kernel.

NB: On an Ubuntu system, you shouldn’t use the .run installer, instead add the Ubuntu graphics ppa and install the driver from there.