GRID 3.0 Successfully installs on ESXI 6.0.2 with M60 GPU but fails to verify via nvidia-smi

We have a clean SuperMicro server and installed VMWare ESXI 6.0.2 build 360759. Entered "Maintenance Mode", then followed the steps and installed the NVDIA Host Driver from the latest guide docs, rebooted and turned off "Maintenance Mode". Once it came back up and we ssh’d in to verify it via various commands they all verified except when using the nvidia-smi command which returns: Failed to initialize NVML: Unknown Error

NOTE: The same hardware worked properly in GRID 2.0

Hardware/Software list:

Supermicro Chassis 1028GQ-TRT
Dual Xeon E5-2600v3 2.60
256 gib memory
(4) NVIDIA M-60 cards installed and in graphics mode
(2) 480 gig SSD drives
ESXI 6.0.2 build 360759
NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_361.40-1OEM.600.0.0.2494585.vib

Steps taken to instal:

•	Installed ESXI 6 on a clean system
•	Enabled SSH
•	No vm’s or Datastores setup yet
•	Setup clock on server:  ntp.org.pool via vSphere
•	Checked off under Configuration/Software/Advanced Settings/VMkernel/Boot: ”VMkernel.Boot.disableACSCheck

" and clicked "OK"
• Entered Maintenance Mode
• Downloaded Grid software from NVIDIA License center under Recent Product Releases from this link:
https://nvidia.flexnetoperations.com/control/nvda/viewRecentProductReleases
• Grabbed the April 4th release of Grid 3.0 for vSphere 6.0
• Copied: NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_361.40-1OEM.600.0.0.2494585.vib to tmp folder
• SSH’d into server
• Ran: esxcli software vib install -v /tmp/NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_361.40-1OEM.600.0.0.2494585.vib

Result:

Installation Result
Message: Operation finished successfully.
Reboot Required: false
VIBs Installed: NVIDIA_bootbank_NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_361.40-1OEM.600.0.0.2494585
VIBs Removed:
VIBs Skipped:

?	Rebooted Server and turned off Maintenance mode
?	SSH'd into server and verified install

Verify Results:

[root@localhost:~] esxcli software vib list | grep -i nvidia

NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver 361.40-1OEM.600.0.0.2494585 NVIDIA VMwareAccepted 2016-05-04

[root@localhost:~] vmkload_mod -l | grep nvidia

nvidia 0 10012

[root@localhost:~]esxcfg-module -l | grep nvidia

nvidia 0 10012

[root@localhost:~] nvidia-smi

Failed to initialize NVML: Unknown Error

No passthru’s setup on ESXI it’s just a clean server with no vm’s or datastores

If we continue and go forward try to create VM’s and setup via vCenter to add the M60 card, none of the profiles are listed.

We’ve gone through this on 2 other servers with different CPU’s but the same software and M60 cards several times with the exact same results.

Please help.

Thanks! Alex

Hi Alex,

I’m afraid I’m not a VMware expert myself. But I’m checking for known issues with the support and product teams. You are entitled to full support with M60 and GRID 3.0 - have you raised a support case yet?

Best wishes,
Rachel

No I haven’t raised a support case. Should I go ahead and create one?

Yes that would be good, raise a support case and pm (personal message) me the number and I’ll keep an eye on it. One of our engineers is already looking into this. Please add my name to the ticket so frontline don’t have to chase info and I’ll fil them in.

Rachel

At the ESXi host CLI please run

lspci –n | grep 10de

then post the result here.

Hi Jason, here are the results

[root@localhost:~] lspci -n | grep 10de
0000:04:00.0 Class 0300: 10de:13f2 [vmgfx6]
0000:05:00.0 Class 0300: 10de:13f2 [vmgfx7]
0000:08:00.0 Class 0300: 10de:13f2 [vmgfx4]
0000:09:00.0 Class 0300: 10de:13f2 [vmgfx5]
0000:83:00.0 Class 0300: 10de:13f2 [vmgfx2]
0000:84:00.0 Class 0300: 10de:13f2 [vmgfx3]
0000:87:00.0 Class 0300: 10de:13f2 [vmgfx0]
0000:88:00.0 Class 0300: 10de:13f2 [vmgfx1]

were the GPU’s factory fitted?

These GPU’s were provided from NVIDIA directly as we are an NVIDIA partner and we installed them ourselves. The same GPU’s just successfully completed the NVQUAL on this server.

This shows the M60 / M6 GPU is correctly set in graphics mode - anyone else experiencing M60 / M6 issues with can double-check this easily, following this advice http://nvidia.custhelp.com/app/answers/detail/a_id/4106/. I’m afraid I haven’t got any suggestions for this case though.

More things I’ve tried

I took out 3 of the 4 GPUs and disabled "Above 4G Decoding" in the SuperMicro BIOS.

(1) GPU and “Above 4g Decoding” disabled and ran nvidia-smi to properly return:

[root@localhost:~] nvidia-smi
Thu May 5 19:18:01 2016
±-----------------------------------------------------+
| NVIDIA-SMI 361.40 Driver Version: 361.40 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:83:00.0 Off | Off |
| N/A 38C P8 24W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M60 On | 0000:84:00.0 Off | Off |
| N/A 33C P8 24W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±--------------------------------------

I then enabled “Above 4g Decoding” and left the (1) card and got the: “Failed to initialize NVML: Unknown Error” when running nvidia-smi

I installed a 2nd GPU and disabled “Above 4g Decoding” and booted and get the BIOS ERROR: “Insufficient PCI Resources Detected” So it won’t boot unless the “Above 4g Decoding” is enabled with more than (1) GPU installed.

So I rebooted with “Above 4g Decoding” enabled and tried the nvidia-smi and got the same ”Failed to initialize NVML: Unknown Error”

Basically it’s only working with only (1) card installed and “Above 4g Decoding” disabled.

Yes ESXi still requires below 4G.

Now we’re getting somewhere.

Are there any other PCI devices installed?

Are you at the latest BIOS revision for the chassis?

Have you checked the VMware HCL for this configuration?

Do you have the means to try a hypervisor that is not limited to the below 4G decoding limit? (XenServer is the easiest).

After hearing back from SuperMicro and doing some tweaking here we now have 2 GPUs working. They suggested we change the MMIOHBase setting to “2T”. The setting is located under Advanced PCIe/PCI/PnP Configuration. We also changed the following in the BIOS:
https://dl.dropboxusercontent.com/u/4009063/PCIe_PCI_PnP_Part_2%20Changes.jpg

The above BIOS settings work with 3 cards but when you add the 4th card then only 3 are recognized. I tested the one card itself and it works by itself with all the others removed. I just can’t get all 4 to be seen via nvidia-smi and lspci -n | grep 10de

[root@localhost:~] lspci -n | grep 10de
0000:04:00.0 Class 0300: 10de:13f2 [vmgfx4]
0000:05:00.0 Class 0300: 10de:13f2 [vmgfx5]
0000:83:00.0 Class 0300: 10de:13f2 [vmgfx2]
0000:84:00.0 Class 0300: 10de:13f2 [vmgfx3]
0000:87:00.0 Class 0300: 10de:13f2 [vmgfx0]
0000:88:00.0 Class 0300: 10de:13f2 [vmgfx1]

More fun stuff to figure out.

Here’s what I get with the 4 cards installed:

[root@localhost:~] lspci -nvv | egrep "^[a-f0-9]|Memory at"
0000:00:00.0 Class 0600: 8086:2f00 [PCIe RP[0000:00:00.0]]
0000:00:01.0 Class 0604: 8086:2f02 [PCIe RP[0000:00:01.0]]
0000:00:02.0 Class 0604: 8086:2f04 [PCIe RP[0000:00:02.0]]
0000:00:03.0 Class 0604: 8086:2f08 [PCIe RP[0000:00:03.0]]
0000:00:04.0 Class 0880: 8086:2f20 
0000:00:04.1 Class 0880: 8086:2f21 
0000:00:04.2 Class 0880: 8086:2f22 
0000:00:04.3 Class 0880: 8086:2f23 
0000:00:04.4 Class 0880: 8086:2f24 
0000:00:04.5 Class 0880: 8086:2f25 
0000:00:04.6 Class 0880: 8086:2f26 
0000:00:04.7 Class 0880: 8086:2f27 
0000:00:05.0 Class 0880: 8086:2f28 
0000:00:05.1 Class 0880: 8086:2f29 
0000:00:05.2 Class 0880: 8086:2f2a 
0000:00:05.4 Class 0800: 8086:2f2c 
0000:00:11.0 Class ff00: 8086:8d7c 
0000:00:11.4 Class 0106: 8086:8d62 [vmhba0]
0000:00:14.0 Class 0c03: 8086:8d31 
0000:00:16.0 Class 0780: 8086:8d3a 
0000:00:16.1 Class 0780: 8086:8d3b 
0000:00:1a.0 Class 0c03: 8086:8d2d 
0000:00:1c.0 Class 0604: 8086:8d10 [PCIe RP[0000:00:1c.0]]
0000:00:1c.4 Class 0604: 8086:8d18 [PCIe RP[0000:00:1c.4]]
0000:00:1d.0 Class 0c03: 8086:8d26 
0000:00:1f.0 Class 0601: 8086:8d44 
0000:00:1f.2 Class 0106: 8086:8d02 [vmhba1]
0000:00:1f.3 Class 0c05: 8086:8d22 
0000:02:00.0 Class 0604: 10b5:8747 
0000:03:08.0 Class 0604: 10b5:8747 
0000:03:10.0 Class 0604: 10b5:8747 
0000:04:00.0 Class 0300: 10de:13f2 [vmgfx4]
0000:05:00.0 Class 0300: 10de:13f2 [vmgfx5]
0000:07:00.0 Class 0200: 8086:1528 [vmnic0]
0000:07:00.1 Class 0200: 8086:1528 [vmnic1]
0000:08:00.0 Class 0604: 1a03:1150 
0000:09:00.0 Class 0300: 1a03:2000 
0000:7f:08.0 Class 0880: 8086:2f80 
0000:7f:08.2 Class 1101: 8086:2f32 
0000:7f:08.3 Class 0880: 8086:2f83 
0000:7f:09.0 Class 0880: 8086:2f90 
0000:7f:09.2 Class 1101: 8086:2f33 
0000:7f:09.3 Class 0880: 8086:2f93 
0000:7f:0b.0 Class 0880: 8086:2f81 
0000:7f:0b.1 Class 1101: 8086:2f36 
0000:7f:0b.2 Class 1101: 8086:2f37 
0000:7f:0c.0 Class 0880: 8086:2fe0 
0000:7f:0c.1 Class 0880: 8086:2fe1 
0000:7f:0c.2 Class 0880: 8086:2fe2 
0000:7f:0c.3 Class 0880: 8086:2fe3 
0000:7f:0c.4 Class 0880: 8086:2fe4 
0000:7f:0c.5 Class 0880: 8086:2fe5 
0000:7f:0c.6 Class 0880: 8086:2fe6 
0000:7f:0c.7 Class 0880: 8086:2fe7 
0000:7f:0d.0 Class 0880: 8086:2fe8 
0000:7f:0d.1 Class 0880: 8086:2fe9 
0000:7f:0d.2 Class 0880: 8086:2fea 
0000:7f:0d.3 Class 0880: 8086:2feb 
0000:7f:0d.4 Class 0880: 8086:2fec 
0000:7f:0d.5 Class 0880: 8086:2fed 
0000:7f:0f.0 Class 0880: 8086:2ff8 
0000:7f:0f.1 Class 0880: 8086:2ff9 
0000:7f:0f.2 Class 0880: 8086:2ffa 
0000:7f:0f.3 Class 0880: 8086:2ffb 
0000:7f:0f.4 Class 0880: 8086:2ffc 
0000:7f:0f.5 Class 0880: 8086:2ffd 
0000:7f:0f.6 Class 0880: 8086:2ffe 
0000:7f:10.0 Class 0880: 8086:2f1d 
0000:7f:10.1 Class 1101: 8086:2f34 
0000:7f:10.5 Class 0880: 8086:2f1e 
0000:7f:10.6 Class 1101: 8086:2f7d 
0000:7f:10.7 Class 0880: 8086:2f1f 
0000:7f:12.0 Class 0880: 8086:2fa0 
0000:7f:12.1 Class 1101: 8086:2f30 
0000:7f:12.4 Class 0880: 8086:2f60 
0000:7f:12.5 Class 1101: 8086:2f38 
0000:7f:13.0 Class 0880: 8086:2fa8 
0000:7f:13.1 Class 0880: 8086:2f71 
0000:7f:13.2 Class 0880: 8086:2faa 
0000:7f:13.3 Class 0880: 8086:2fab 
0000:7f:13.6 Class 0880: 8086:2fae 
0000:7f:13.7 Class 0880: 8086:2faf 
0000:7f:14.0 Class 0880: 8086:2fb0 
0000:7f:14.1 Class 0880: 8086:2fb1 
0000:7f:14.2 Class 0880: 8086:2fb2 
0000:7f:14.3 Class 0880: 8086:2fb3 
0000:7f:14.4 Class 0880: 8086:2fbc 
0000:7f:14.5 Class 0880: 8086:2fbd 
0000:7f:14.6 Class 0880: 8086:2fbe 
0000:7f:14.7 Class 0880: 8086:2fbf 
0000:7f:16.0 Class 0880: 8086:2f68 
0000:7f:16.1 Class 0880: 8086:2f79 
0000:7f:16.2 Class 0880: 8086:2f6a 
0000:7f:16.3 Class 0880: 8086:2f6b 
0000:7f:16.6 Class 0880: 8086:2f6e 
0000:7f:16.7 Class 0880: 8086:2f6f 
0000:7f:17.0 Class 0880: 8086:2fd0 
0000:7f:17.1 Class 0880: 8086:2fd1 
0000:7f:17.2 Class 0880: 8086:2fd2 
0000:7f:17.3 Class 0880: 8086:2fd3 
0000:7f:17.4 Class 0880: 8086:2fb8 
0000:7f:17.5 Class 0880: 8086:2fb9 
0000:7f:17.6 Class 0880: 8086:2fba 
0000:7f:17.7 Class 0880: 8086:2fbb 
0000:7f:1e.0 Class 0880: 8086:2f98 
0000:7f:1e.1 Class 0880: 8086:2f99 
0000:7f:1e.2 Class 0880: 8086:2f9a 
0000:7f:1e.3 Class 0880: 8086:2fc0 
0000:7f:1e.4 Class 0880: 8086:2f9c 
0000:7f:1f.0 Class 0880: 8086:2f88 
0000:7f:1f.2 Class 0880: 8086:2f8a 
0000:80:02.0 Class 0604: 8086:2f04 [PCIe RP[0000:80:02.0]]
0000:80:03.0 Class 0604: 8086:2f08 [PCIe RP[0000:80:03.0]]
0000:80:04.0 Class 0880: 8086:2f20 
0000:80:04.1 Class 0880: 8086:2f21 
0000:80:04.2 Class 0880: 8086:2f22 
0000:80:04.3 Class 0880: 8086:2f23 
0000:80:04.4 Class 0880: 8086:2f24 
0000:80:04.5 Class 0880: 8086:2f25 
0000:80:04.6 Class 0880: 8086:2f26 
0000:80:04.7 Class 0880: 8086:2f27 
0000:80:05.0 Class 0880: 8086:2f28 
0000:80:05.1 Class 0880: 8086:2f29 
0000:80:05.2 Class 0880: 8086:2f2a 
0000:80:05.4 Class 0800: 8086:2f2c 
0000:81:00.0 Class 0604: 10b5:8747 
0000:82:08.0 Class 0604: 10b5:8747 
0000:82:10.0 Class 0604: 10b5:8747 
0000:83:00.0 Class 0300: 10de:13f2 [vmgfx2]
0000:84:00.0 Class 0300: 10de:13f2 [vmgfx3]
0000:85:00.0 Class 0604: 10b5:8747 
0000:86:08.0 Class 0604: 10b5:8747 
0000:86:10.0 Class 0604: 10b5:8747 
0000:87:00.0 Class 0300: 10de:13f2 [vmgfx0]
0000:88:00.0 Class 0300: 10de:13f2 [vmgfx1]
0000:ff:08.0 Class 0880: 8086:2f80 
0000:ff:08.2 Class 1101: 8086:2f32 
0000:ff:08.3 Class 0880: 8086:2f83 
0000:ff:09.0 Class 0880: 8086:2f90 
0000:ff:09.2 Class 1101: 8086:2f33 
0000:ff:09.3 Class 0880: 8086:2f93 
0000:ff:0b.0 Class 0880: 8086:2f81 
0000:ff:0b.1 Class 1101: 8086:2f36 
0000:ff:0b.2 Class 1101: 8086:2f37 
0000:ff:0c.0 Class 0880: 8086:2fe0 
0000:ff:0c.1 Class 0880: 8086:2fe1 
0000:ff:0c.2 Class 0880: 8086:2fe2 
0000:ff:0c.3 Class 0880: 8086:2fe3 
0000:ff:0c.4 Class 0880: 8086:2fe4 
0000:ff:0c.5 Class 0880: 8086:2fe5 
0000:ff:0c.6 Class 0880: 8086:2fe6 
0000:ff:0c.7 Class 0880: 8086:2fe7 
0000:ff:0d.0 Class 0880: 8086:2fe8 
0000:ff:0d.1 Class 0880: 8086:2fe9 
0000:ff:0d.2 Class 0880: 8086:2fea 
0000:ff:0d.3 Class 0880: 8086:2feb 
0000:ff:0d.4 Class 0880: 8086:2fec 
0000:ff:0d.5 Class 0880: 8086:2fed 
0000:ff:0f.0 Class 0880: 8086:2ff8 
0000:ff:0f.1 Class 0880: 8086:2ff9 
0000:ff:0f.2 Class 0880: 8086:2ffa 
0000:ff:0f.3 Class 0880: 8086:2ffb 
0000:ff:0f.4 Class 0880: 8086:2ffc 
0000:ff:0f.5 Class 0880: 8086:2ffd 
0000:ff:0f.6 Class 0880: 8086:2ffe 
0000:ff:10.0 Class 0880: 8086:2f1d 
0000:ff:10.1 Class 1101: 8086:2f34 
0000:ff:10.5 Class 0880: 8086:2f1e 
0000:ff:10.6 Class 1101: 8086:2f7d 
0000:ff:10.7 Class 0880: 8086:2f1f 
0000:ff:12.0 Class 0880: 8086:2fa0 
0000:ff:12.1 Class 1101: 8086:2f30 
0000:ff:12.4 Class 0880: 8086:2f60 
0000:ff:12.5 Class 1101: 8086:2f38 
0000:ff:13.0 Class 0880: 8086:2fa8 
0000:ff:13.1 Class 0880: 8086:2f71 
0000:ff:13.2 Class 0880: 8086:2faa 
0000:ff:13.3 Class 0880: 8086:2fab 
0000:ff:13.6 Class 0880: 8086:2fae 
0000:ff:13.7 Class 0880: 8086:2faf 
0000:ff:14.0 Class 0880: 8086:2fb0 
0000:ff:14.1 Class 0880: 8086:2fb1 
0000:ff:14.2 Class 0880: 8086:2fb2 
0000:ff:14.3 Class 0880: 8086:2fb3 
0000:ff:14.4 Class 0880: 8086:2fbc 
0000:ff:14.5 Class 0880: 8086:2fbd 
0000:ff:14.6 Class 0880: 8086:2fbe 
0000:ff:14.7 Class 0880: 8086:2fbf 
0000:ff:16.0 Class 0880: 8086:2f68 
0000:ff:16.1 Class 0880: 8086:2f79 
0000:ff:16.2 Class 0880: 8086:2f6a 
0000:ff:16.3 Class 0880: 8086:2f6b 
0000:ff:16.6 Class 0880: 8086:2f6e 
0000:ff:16.7 Class 0880: 8086:2f6f 
0000:ff:17.0 Class 0880: 8086:2fd0 
0000:ff:17.1 Class 0880: 8086:2fd1 
0000:ff:17.2 Class 0880: 8086:2fd2 
0000:ff:17.3 Class 0880: 8086:2fd3 
0000:ff:17.4 Class 0880: 8086:2fb8 
0000:ff:17.5 Class 0880: 8086:2fb9 
0000:ff:17.6 Class 0880: 8086:2fba 
0000:ff:17.7 Class 0880: 8086:2fbb 
0000:ff:1e.0 Class 0880: 8086:2f98 
0000:ff:1e.1 Class 0880: 8086:2f99 
0000:ff:1e.2 Class 0880: 8086:2f9a 
0000:ff:1e.3 Class 0880: 8086:2fc0 
0000:ff:1e.4 Class 0880: 8086:2f9c 
0000:ff:1f.0 Class 0880: 8086:2f88 
0000:ff:1f.2 Class 0880: 8086:2f8a

[root@localhost:~] nvidia-smi
Fri May 6 20:03:28 2016
±-----------------------------------------------------+
| NVIDIA-SMI 361.40 Driver Version: 361.40 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:04:00.0 Off | Off |
| N/A 38C P8 24W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M60 On | 0000:05:00.0 Off | Off |
| N/A 34C P8 23W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla M60 On | 0000:83:00.0 Off | Off |
| N/A 33C P8 24W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla M60 On | 0000:84:00.0 Off | Off |
| N/A 30C P8 23W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 Tesla M60 On | 0000:87:00.0 Off | Off |
| N/A 33C P8 25W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 Tesla M60 On | 0000:88:00.0 Off | Off |
| N/A 30C P8 23W / 150W | 19MiB / 8191MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |

Thank you for taking the time to update everyone on supermicro’s recommendations. Our support org will look to improve the documentation around BIOS needs on MMIO for hypervisors so that your experience and time is used to improve the experience for others.

I’ll update the thread when support write this up.

Thank you,
Rachel

Summary of what has been learned from this case:

• Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized.
o Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized. | NVIDIA