Issue with Tesla M6 on Cisco B200 M4 after reboot

Hello,

I have a configuration of GPU M6 (Cisco) installed in B200 M4 blade, and for some reason after reboot this configuration stopped working.

I just can’t power on VM’s on ESXi 6.0U3

Error in VMware:
Failed to start the virtual machine.
Module DevicePowerOn power on failed.
Could not initialize plugin ‘/usr/lib64/vmware/plugin/libnvidia-vgx.so’ for vGPU ‘grid_m6-4q’.

nvidia-smi showing everything is ok:
nvidia-smi
Wed Oct 11 15:59:12 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.73 Driver Version: 384.73 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M6 On | 00000000:81:00.0 Off | 0 |
| N/A 48C P8 16W / 100W | 13MiB / 7679MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

dmesg | grep -E "NVRM|nvidia"
2017-10-11T15:51:09.402Z cpu15:33461)Loading module nvidia …
2017-10-11T15:51:09.410Z cpu15:33461)Elf: 1865: module nvidia has license NVIDIA
2017-10-11T15:51:09.520Z cpu15:33461)NVRM: vmk_MemPoolCreate passed for 4194304 pages.
2017-10-11T15:51:09.771Z cpu15:33461)NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.73 Mon Aug 21 15:16:25 PDT 2017
2017-10-11T15:51:09.771Z cpu15:33461)Device: 191: Registered driver ‘nvidia’ from 20
2017-10-11T15:51:09.772Z cpu15:33461)Mod: 4943: Initialization of nvidia succeeded with module ID 20.
2017-10-11T15:51:09.772Z cpu15:33461)nvidia loaded successfully.
2017-10-11T15:51:10.712Z cpu30:33460)Device: 326: Found driver nvidia for device 0x590a4304eae944b8
2017-10-11T15:51:10.714Z cpu17:33479)NVRM: nvidia_associate vmgfx0
2017-10-11T15:52:27.278Z cpu3:35450)IntrCookie: 1935: cookie 0x35 moduleID 20 <nvidia> exclusive, flags 0x1d

Try to find any solutions but only 1 article in VMware about change xorg service, but it is running fine on host.

/etc/init.d/xorg status
Xorg is running

I can see configuration in Web client, and everything seems to be ok for VM configuration.

Is someone have any idea about this issue.

Did open a ticket to Cisco to get some help from support, but also want to get maybe some idea from NVidia forum people.

Thanks.

Hi

Is this a new installation or a production one that’s stopped working?

Can you check in vCenter under the "Host Graphics" tab the "Default Graphics Type" is still "Shared Direct"? Also check under the "Graphics Devices" tab that "Active Type" and "Configured Type" are both showing "Shared Direct"?

If the above is correct, then I’d be tempted to uninstall the existing .vib, reboot and reinstall the .vib as that’s nice and quick and non destructive.

Regards

Hi BJoned,

This is a production host stopped working. I think your suggestion is related to v.6.5 not to 6.0U3.

For some reason I can’t change BIOS setting on blade for Memory Mapped IO above 4gb Config to disabled. If I uninstall card from blade it is changed to Disable, but once card in blade it is Enabled automatically and not allowing to change to Disable.

Yes, sorry, that’s for 6.5. Haven’t used 6.0 since 6.5 was released and forgot that setting is not part of 6.0

That’s fine, MMIO above 4G should be "Enabled".

If all you’ve done is rebooted the server, then I’d still try the uninstall / re-install of the .vib and see if that helps. If you’ve applied some updates or made changes to the system somewhere, then obviously review those again …

Regards

Ok, I resolved an issue, for some reason :-) gpumode for GPU change to Compute… EVEN I got output below before switch it again to Graphics.


Before reflash GPU to Graphics mode
:
gpumodeswitch --listgpumodes

NVIDIA GPU Mode Switch Utility Version 1.23.0
Copyright (C) 2015, NVIDIA Corporation. All Rights Reserved.

Tesla M6 (10DE,13F3,10DE,1143) H:–:NRM S:00,B:81,PCI,D:00,F:00
Adapter: Tesla M6 (10DE,13F3,10DE,1143) H:–:NRM S:00,B:81,PCI,D:00,F:00

Identifying EEPROM…
EEPROM ID (EF,3013) : WBond W25X40A 2.7-3.6V 4096Kx1S, page
GPU Mode: Graphics