I’m attempting to create VMs on ESXi 6.7 where vGPU has been installed on the hypervisor. So far, it has all ended in tears.
I’ve successfully installed vGPU - as witnessed by running nvidia-smi after logging into the ESXi host via SSH. I utilized the latest vGPU for ESXi 6.7, which installed the NVIDIA driver 430.46.
The hardware is a Dell C4140 with four V100-PCIe cards. The BIOS setting for MMIO base is 12TB.
I have all the GPUs configured as ‘Shared Direct’.
The VM I created has 8-cores and 64GB of memory (all reserved, of course). I selected an NVIDIA GRID vGPU as a Shared PCI device.
After loading the OS, the command ‘lspci |grep -i nvid’ shows that I have a single V100 (device 1db4).
Installing CUDA always results in a failure when the install script attempts to install the driver.
I’ve attempted this on SLES12 SP3 with the CUDA 9.1 installer, and SLES15 SP1 with the CUDA 10.1 installer.
I’ve then attempted to load the driver directly using the associated install script.
Have I missed a step? Am I using the wrong driver combo? What should I try next?
FYI - these same machines work perfectly when the GPUs are in pass-thru mode without vGPU, and I install these same OS/CUDA combos on the VMs.
I’d be happy to provide the error messages if that would be helpful.