Cannot install NVIDIA driver in ESXi VM with vGPU

I’m attempting to create VMs on ESXi 6.7 where vGPU has been installed on the hypervisor. So far, it has all ended in tears.

I’ve successfully installed vGPU - as witnessed by running nvidia-smi after logging into the ESXi host via SSH. I utilized the latest vGPU for ESXi 6.7, which installed the NVIDIA driver 430.46.
The hardware is a Dell C4140 with four V100-PCIe cards. The BIOS setting for MMIO base is 12TB.
I have all the GPUs configured as ‘Shared Direct’.
The VM I created has 8-cores and 64GB of memory (all reserved, of course). I selected an NVIDIA GRID vGPU as a Shared PCI device.
After loading the OS, the command ‘lspci |grep -i nvid’ shows that I have a single V100 (device 1db4).

Installing CUDA always results in a failure when the install script attempts to install the driver.

I’ve attempted this on SLES12 SP3 with the CUDA 9.1 installer, and SLES15 SP1 with the CUDA 10.1 installer.

I’ve then attempted to load the driver directly using the associated install script.

Have I missed a step? Am I using the wrong driver combo? What should I try next?

FYI - these same machines work perfectly when the GPUs are in pass-thru mode without vGPU, and I install these same OS/CUDA combos on the VMs.

I’d be happy to provide the error messages if that would be helpful.

Thanks!

I’ve made some progress.

I was able to successfully install the grid driver via NVIDIA-Linux-x86_64-430.46-grid.run. The trick is that it must be run prior to the CUDA install. After installing the driver, you need to install CUDA without the driver.

This works for SLES12 SP3, but not SLES15 SP1.

The next problem is that none of the sample programs will run.

nvidia-smi produces the following output:

Fri Sep 20 16:38:13 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.46 Driver Version: 430.46 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID V100-8Q On | 00000000:02:01.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 528MiB / 8128MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

However, when I run the sample vectoradd, I get this:
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

I ran a program that I wrote so that I could dig a little deeper. After initializing the driver, it is able to find the GPU. Here’s an output from my program:
********** GPU 0 Info *************
Device Name: GRID V100-8Q
Max Threads per Block:1024
Max X Dim per Block: 1024
Max Y Dim per Block: 1024
Max Z Dim per Block: 64
Max X Dim per Grid: 2147483647
Max Y Dim per Grid: 65535
Max Z Dim per Grid: 65535
Shared Mem/Block: 49152
SMX Count: 80
warp Size: 32
Compute Capability: 7.0
Global Memory Bytes: 4294967295
PCI Bus ID: 0x2
PCI Device ID: 0x1
Unified Addressing 1


Everything seems to be OK - except that when I try to create a CUDA context, it hangs. It consumes 100% of a CPU core, but never returns. My guess is that the CPU is being burned in some kind of polling loop. The driver call that never returns is cuCtxCreate().

Any ideas what to do next?

Missing license?
Without a license you won’t be able to run CUDA :)

Regards
Simon

I downloaded the GRID Evaluation Edition from the NVIDIA Software Licensing Center. I installed the binaries that were provided in the download. There was no step in the installation process that requested an activation key, so I assumed that there was a countdown timer built into that edition. Since I’ve only been attempting this for a couple weeks, I’m certain that the trial period is not expired.
Is there an extra step that is required for activation that I missed?
Like I wrote in my previous post, the driver call hangs - I’m not getting an error return code.

Thanks!
Doug

I dug a little deeper using GDB and more printfs. I’m getting an error code 999 (unknown error) as a response to cuCtxCreate(). Is that possibly a licence problem?

What about reading our user guide? For sure you need a license and every step is properly documented.

Apologies.
I had activation keys in hand, but had not yet used them. I was led down the primrose path because everything installed and I was able to run nvidia-smi in the VM. I read the sentence in the docs that said that performance is restricted without a license, but missed the sentence that says that CUDA is disabled until a license is provided.
I’ll get started on that. Sorry for wasting your time.

One suggestion: have an ‘Invalid or Missing License’ error code.