Cannot install NVIDIA driver in ESXi VM with vGPU

Dug · September 18, 2019, 11:12pm

I’m attempting to create VMs on ESXi 6.7 where vGPU has been installed on the hypervisor. So far, it has all ended in tears.

I’ve successfully installed vGPU - as witnessed by running nvidia-smi after logging into the ESXi host via SSH. I utilized the latest vGPU for ESXi 6.7, which installed the NVIDIA driver 430.46.
The hardware is a Dell C4140 with four V100-PCIe cards. The BIOS setting for MMIO base is 12TB.
I have all the GPUs configured as ‘Shared Direct’.
The VM I created has 8-cores and 64GB of memory (all reserved, of course). I selected an NVIDIA GRID vGPU as a Shared PCI device.
After loading the OS, the command ‘lspci |grep -i nvid’ shows that I have a single V100 (device 1db4).

Installing CUDA always results in a failure when the install script attempts to install the driver.

I’ve attempted this on SLES12 SP3 with the CUDA 9.1 installer, and SLES15 SP1 with the CUDA 10.1 installer.

I’ve then attempted to load the driver directly using the associated install script.

Have I missed a step? Am I using the wrong driver combo? What should I try next?

FYI - these same machines work perfectly when the GPUs are in pass-thru mode without vGPU, and I install these same OS/CUDA combos on the VMs.

I’d be happy to provide the error messages if that would be helpful.

Thanks!

Dug · September 20, 2019, 11:54pm

I’ve made some progress.

I was able to successfully install the grid driver via NVIDIA-Linux-x86_64-430.46-grid.run. The trick is that it must be run prior to the CUDA install. After installing the driver, you need to install CUDA without the driver.

This works for SLES12 SP3, but not SLES15 SP1.

The next problem is that none of the sample programs will run.

nvidia-smi produces the following output:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

However, when I run the sample vectoradd, I get this:
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

I ran a program that I wrote so that I could dig a little deeper. After initializing the driver, it is able to find the GPU. Here’s an output from my program:
********** GPU 0 Info *************
Device Name: GRID V100-8Q
Max Threads per Block:1024
Max X Dim per Block: 1024
Max Y Dim per Block: 1024
Max Z Dim per Block: 64
Max X Dim per Grid: 2147483647
Max Y Dim per Grid: 65535
Max Z Dim per Grid: 65535
Shared Mem/Block: 49152
SMX Count: 80
warp Size: 32
Compute Capability: 7.0
Global Memory Bytes: 4294967295
PCI Bus ID: 0x2
PCI Device ID: 0x1
Unified Addressing 1

Everything seems to be OK - except that when I try to create a CUDA context, it hangs. It consumes 100% of a CPU core, but never returns. My guess is that the CPU is being burned in some kind of polling loop. The driver call that never returns is cuCtxCreate().

Any ideas what to do next?

sschaber · September 23, 2019, 7:35am

Missing license?
Without a license you won’t be able to run CUDA :)

Regards
Simon

Dug · September 23, 2019, 8:39pm

I downloaded the GRID Evaluation Edition from the NVIDIA Software Licensing Center. I installed the binaries that were provided in the download. There was no step in the installation process that requested an activation key, so I assumed that there was a countdown timer built into that edition. Since I’ve only been attempting this for a couple weeks, I’m certain that the trial period is not expired.
Is there an extra step that is required for activation that I missed?
Like I wrote in my previous post, the driver call hangs - I’m not getting an error return code.

Thanks!
Doug

Dug · September 23, 2019, 10:59pm

I dug a little deeper using GDB and more printfs. I’m getting an error code 999 (unknown error) as a response to cuCtxCreate(). Is that possibly a licence problem?

sschaber · September 25, 2019, 11:51am

What about reading our user guide? For sure you need a license and every step is properly documented.

Dug · September 26, 2019, 12:15am

Apologies.
I had activation keys in hand, but had not yet used them. I was led down the primrose path because everything installed and I was able to run nvidia-smi in the VM. I read the sentence in the docs that said that performance is restricted without a license, but missed the sentence that says that CUDA is disabled until a license is provided.
I’ll get started on that. Sorry for wasting your time.

One suggestion: have an ‘Invalid or Missing License’ error code.

Topic		Replies	Views
ESXi 6.7 + Tesla V100 + 430.27 not working NVIDIA Virtual GPU Drivers	8	14979	July 23, 2019
Installation of GPU driver on VM CUDA Setup and Installation	0	809	October 3, 2019
ESXi 6.5su3, NVIDIA P4 no PCI shared device NVIDIA Virtual GPU Drivers	0	2085	February 19, 2020
Nvidia VMware vSphere-6.7 NVIDIA Virtual GPU Technology	14	10168	August 19, 2019
OpenCL not available when using NVIDIA GRID vGPU grid_m60-8q profile on VSphere 6.7 NVIDIA Virtual GPU Technology cuda	5	2435	January 13, 2021
Unable to install NVIDIA driver in P40 General Discussion	3	2386	March 25, 2022
Centos 7.7 Installation Tesla v100 graphics card driver failed Linux	18	1378	October 12, 2021
Vmware Esxi 7.0U2a, After installing Nvidia driver .VIB Nothing Happens General Discussion vmware-solutions	5	4666	September 10, 2021
Can't install NVIDIA driver on a Asus Vivobook n580gd with CentOS-7.6 (NVIDIA-SMI has failed because... CUDA Setup and Installation	1	1694	June 11, 2019
NVIDIA Vmware vSphere-6.5 NVIDIA Virtual GPU Technology	20	78262	July 15, 2019

Cannot install NVIDIA driver in ESXi VM with vGPU

Related topics