Centos 7.7 Installation Tesla v100 graphics card driver failed

I have a VM with CentOS 7.7 where I want to install nvidia driver 440, which is hosted by an ESXI host with nvidia driver already installed. When I try to install the nvidia driver on VM, I get this:
Error: Unable to load the ‘nvidia-drm’ kernel module.

You’ll have to hide the hypervisor.

Hello! What do you mean? What I have to do for hidding the hypervisor?

Forget about that, Teslas should work without.
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

nvidia-bug-report.log.log (54.3 KB) nvidia-bug-report.log.log (54.3 KB)

You’re running into
This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:02:00.0)

Please check your bios for an option like “above 4G decoding” or “Large/64bit BARs” and enable it.
What kind of host mainboard/server are you using?

Dell PowerEdge R740

Might be tricky:
https://www.dell.com/community/PowerEdge-Hardware-General/Enabling-Memory-Mapped-IO-gt-4GB-has-issues-on-R720/td-p/4468413

I found that above 4G decoding and is Enabled. The problem isn’t on my host, because my host detects the Nvidia card, also I’ve installed nvdia driver on it, and everything it is ok. But when I try to install a VM on the host, then the problem show up. The VM detect de PCI Nvidia, but when I try to install the driver is the problem.

Then please check if you enabled the correct options for the vm:
https://kb.vmware.com/s/article/2139299
Edit: updated article on that:
https://kb.vmware.com/s/article/2142307

This is not working. I don’t want to use passthrough. Thanks!

This is also valid for vgpu setups, please see:
https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-vmware-vsphere/index.html
->" Requirements for Using C-Series Virutal Compute Server vGPUs"

Hello! I modified what you say, but still not working. If I choose Quadro vDWS, or GRID Virtual Application, everything is fine. The problem is when I want to use Virtual Compute Server.

which esxi version are you running?

6.7

Server

• GRID card model(s) = Tesla V100-PCIE-16GB
• Server Brand / Model / Memory per server = Dell PowerEdge R740
• Number of GRID cards and models installed per server = 1 GRID card → Tesla V100-PCIE-16GB
• Virtualization platform / Hypervisor = vSphere 6.7.0, 15160138 Hypervisor
• Patches applied over host hypervisor (if any) = No patches;
• vSGA, vGPU, VDA, DDA, HDX 3D Pro, RemoteFX, Bare Metal or Pass-through = vGPU
• vGPU Manager driver version (vib/rpm) = NVIDIA-VMware_ESXi_6.7_Host_Driver-430.83-1OEM.670.0.0.8169922x86_64.vib installed
• vGPU profile used for each GPU = 1
• Type of Profile used / Number of VMs using each vGPU profile = 1 VM
• DRS Enabled (if part of ESXi cluster) = No

VM

• Display driver version = NVIDIA-Linux-x86_64-430.83-grid.run
• OS / Version = CentOS 7.7
• System Memory = 32 GB
• Number of vCPUs =16
• Number of displays / Display resolution = -
• Remoting Solution / Method of connecting to VM = ssh
• Version or Release of Remoting Solution = -
• Name of VM having issue (if applicable) = vGPU-AI-ML-01

License Server

• License Manager Software version = 2019.11.0.27609837; Build Number:27609831
• OS / Version = CentOS Linux release 7.7
• VM or physical PC = VM

nvidia-bug-report.log.gz

Guide used for installation: 430.83-432.33-grid-vgpu-user-guide.pdf

Steps for installation:

  1. NVIDIA Virtual GPU Manager Package for vSphere → done
  2. Verifying the Installation of the NVIDIA vGPU Software
    Package for vSphere → done
  3. Configuring VMware vMotion with vGPU for
    VMware vSphere → done
  4. Changing the Default Graphics Type in VMware
    vSphere 6.7 → done
  5. Configuring a vSphere VM with NVIDIA GPU → done
    And now the problems:
    After I have configured vSphere with GPU, I have started the VM with CentOS 7.7. After the VM has booted, the installation of NVIDIA GPU has failed.
    The problem: ERROR: Unable to load the ‘nvidia-drm’ kernel module.
    On the VM I have configured the following:
    ->Nvidia graphic card model is displayed
    ->Disabled nouveau driver by changing the configuration /etc/default/grub file. Add the nouveau.modeset=0 into line starting with GRUB_CMDLINE_LINUX.

6.7 should handle that by itself.
Looking at the early dmesg output again, you’re not using efi but csm to boot. With old bios boot, this not goning to work. No 64bit resources available. Please properly configure and install your vm with efi boot.

1 Like

I reinstalled the vm with efi boot and I set these:firmware=“efi" and pciPassthru.64bitMMIOSizeGB = “128" and now everything is fine.Thanks again!