ESXi 6.5 + Tesla M60 - Not working anymore after driver update

Hi and hello,

we have several XL190 Gen8 servers with Tesla M60 adapters running vSphere 6.5.
The cards have not been in use until now - we are preparing for a PoC.
The Adapters were listed in vSphere client and by nvidia-smi:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.106 Driver Version: 367.106 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:89:00.0 Off | Off |
| N/A 36C P8 24W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M60 On | 0000:8A:00.0 Off | Off |
| N/A 31C P8 24W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

No symptoms they were in compute mode, we also had a VM running and using the card.

Then we updated the driver to this version:
NVIDIA-kepler-VMware_ESXi_6.5_Host_Driver 367.128-1OEM.650.0.0.4598673

What we did was a procedure that worked well in our other datacenter:

  • Host -> maintenance

  • esxcli software vib remove -n NVIDIA-VMware_ESXi_6.5_Host_Driver
    Removal Result
    Message: Operation finished successfully.
    Reboot Required: false
    VIBs Installed:
    VIBs Removed: NVIDIA_bootbank_NVIDIA-VMware_ESXi_6.5_Host_Driver_367.106-1OEM.650.0.0.4598673
    VIBs Skipped:

  • reboot

  • installed new driver NVIDIA-kepler-VMware_ESXi_6.5_Host_Driver 367.128-1OEM.650.0.0.4598673

  • reboot

But after that:

[root@VI:~] nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

[root@VI:~] esxcli software vib list | grep -i nvidia
NVIDIA-kepler-VMware_ESXi_6.5_Host_Driver 367.128-1OEM.650.0.0.4598673 NVIDIA VMwareAccepted 2018-12-13

In vSphere client the "Graphics Adapter" has changed from "NVIDIA Tesla M60" to "GM204GL [Tesla M60]"

[root@VI:~] lspci -n | grep 10de
0000:89:00.0 Class 0300: 10de:13f2
0000:8a:00.0 Class 0300: 10de:13f2

This seems to show it the card is still in grapohics mode, IIRC.

Please help!

Kind regards
ZPPO

Here is your issue! You cannot run the old kepler host driver on Maxwell boards!!!
vGPU needs software licenses and therefore you only get the correct drivers in the GRID license portal where you need to create an account first:
nvidia.com/grideval

regards

Simon

Hi Simon,

thanks for your reply!

The driver’s release notes included in
NVIDIA-vGPU-kepler-vSphere-6.5-367.128-370.28.zip
tell me it would work and they even contain a part regarding M60:

"GRID SOFTWARE FOR VMWARE
VSPHERE VERSION 367.128/370.28
RN-07347-001 _v4.7 | July 2018
Release Notes

2.1. Supported NVIDIA GPUs and Validated Server
Platforms
This release of NVIDIA GRID software provides support for the following NVIDIA
GPUs on VMware vSphere, running on validated server hardware platforms:
? GRID K1
? GRID K2
? Tesla M6
? Tesla M10
? Tesla M60
For a list of validated server platforms, refer to NVIDIA GRID Certified Servers.
Tesla M60 and M6 GPUs support compute mode and graphics mode. GRID vGPU
requires GPUs that support both modes to operate in graphics mode.
Recent Tesla M60 GPUs and M6 GPUs are supplied in graphics mode. However, your
GPU might be in compute mode if it is an older Tesla M60 GPU or M6 GPU, or if its
mode has previously been changed.
To configure the mode of Tesla M60 and M6 GPUs, use the gpumodeswitch tool
provided with GRID software releases…"

So why would this info be incorrect?

Regards,
ZPPO

Well,
you shouldn’t mix-up different things. There is the same driver available for Kepler and Maxwell (but only in the GRID license portal). I assume they just didn’t take the effort to create a seperate release note for Kepler only as it is stated very clear on the public download page that the driver you downloaded is only for GRID K1/K2.
The release notes above are mentioning the R367 branch, which for sure also supports Maxwell GPUs.
Once again:
vGPU starting with Maxwell generation requires software licensing and therefore you need to create an account in our GRID licensing portal in order to download the appropriate software!

Regards

Simon