Supported Servers with Tesla M60 & ESXi 6.0 (Dell PowerEdge R720xd)

I want to confirm or find a list of supported servers for the Tesla M60 card with ESXi 6.0. I’ve installed on a Dell PowerEdge R720 the 352.54 vib, and am getting the "nvidia-smi has failed because it couldn’t communicate with the nvidia driver. make sure that the latest nvidia driver is installed and running." message. I’ve validated that using vmkload_mod - l that the driver appears to not be starting either. I’ve checked dmesg, but am unsure as to what to look for to indicate any error, but am not seeing much. I’ve also checked to see if the card shows up using "lspci | grep -i vga" and other variations, but do not see the card, which is why I am suspecting that an R720 will not work.

The driver installs fine, and I’ve done the necessary steps of Maintenance mode, install, reboot, and exit maintenance mode, now multiple times to no avail. I have an R730, that I can try next, but I want to validate that it is worth the effort first.

I’m hoping it’s a misconfiguration, and that I’ve missed a cable connection in the chassis, as I really would like to get this 720 to work. Please let me know.

You can filter the list of certified servers by card type.

R720 is not certified by Dell, R730 is, as long as you have the relevant psu’s, power cable etc. Best to check with Dell on the specific requirements to retrofit the card.

Jason,

  Thank you for that.  That is extremely helpful.  

  Now for the next part, is there a difference in supported R730's?  i.e. an R730xd versus an R730?  You have listed an R730, and I can get my hands on one of those in the future, but I have an R730xd now, that appears to be exhibiting the same behavior.  Would some form of logs be useful?

You should check with Dell. It’s possible that it’s simply a BIOS issue or may well not be supported in the xd chassis.

Do you have the enablement kit for the R730 including power cables and the relevant PSU’s?

Could you direct me to information on the "enablement kit" ?

You need to speak to Dell.

Most servers don’t ship with the required PSU, PCIe risers, cables etc and some may require modified heatsinks or airflow baffles. Each OEM has a different set of additional components which may be required for retrofit. In some cases it’s just a power cable, in others it’s a complete set of PSU’s, risers, baffles, heatsinks and cables so depending on what you already have, depends on what you need to acquire.

The OEM (in your case Dell) are the best people to ask for the details of what you require.

JRR,

Did you have any success in getting the M60 to work on the R720xd? I have a R720 and I am experiencing the same problem.

Thanks

David

JRR,

Did you have any success in getting the M60 to work on the R720xd? I have a R720 and I am experiencing the same problem.

Thanks

David

I know for the R720xd Dell chose not to certify whereas they did the R720 - the R720xd has some extra room for storage which makes everything else a bit more squashed and affected the thermal cooling iirc…

For anyone with a new M60 - I would storngly advise checkign it is in Graphics and not compute mode as per: Having problems with new M6/M60 like VMs fail to power on, NVRM BAR1 error, ECC is enabled, or nvidia-smi fails | NVIDIA

You might want to search the KB database for other reasons nvidia-smi fails: Incorrect BIOS settings on a server when used with a hypervisor can cause MMIO address issues that result in GRID GPUs failing to be recognized. | NVIDIA

BUT as you are dealing with a possibly unsupported server I think as JAson suggests you really need to talk to DELL and your hypervisor vendor as you could be left unsupported even if it works.

Hi Rachel,

I was able to install the gpumodeswitch vib in my ESX6.0u2 host and I was able to successfully change the mode over to graphics. Oddly the gpumodeswitch is able to see the cards. After the mode was changed I installed the software, rebooted the system and receive nothing when I run "vmkload_mod -l | grep nvidia"

David

As it is uncertified I think you need to go back to the server OEM and talk to them. I’m afraid with uncertified configurations this can happen.

Hi All,

I have two questions. I was able to install the GRID 3.0 VIB into ESXi 6.0 U2, (NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_361.40-1OEM.600.0.0.2494585.vib) no issue and everything came up properly. However after the installation, the guide that I should use the gpumodeswitch to switch modes.

Interestingly the instruction in the switchmode doc said to remove any NVIDIA drivers - which was bit weird. But I did that.

I tried to install the modeswitch vib (NVIDIA-GpuModeSwitch-1OEM.600.0.0.2494585.x86_64.vib) and it gave me an InstallationError, saying the vib does not contain a signature. I lowered the acceptance level to community supported, but no luck to install.

any thoughts.

Thanks
Segreen

Did you follow the process in the documentation exactly?

  1. Put the ESXi host into maintenance mode.

vim-cmd hostsvc/maintenance_mode_enter

  1. If an NVIDIA driver is already installed on the ESXi host, remove the driver.
    a) Get the name of the VIB package that contains the NVIDIA driver.

esxcli software vib list | grep -i nvidia

b) Remove the VIB package that contains the NVIDIA driver.

esxcli software vib remove -n NVIDIA-driver-package

NVIDIA-driver-package is the VIB package name that you got in the previous step.
3. Run the esxcli command to install the VIB.

esxcli software vib install -v /NVIDIA-GpuModeSwitch-1OEM.600.0.0.2494585.x86_64.vib

  1. Take the host out of maintenance mode.

vim-cmd hostsvc/maintenance_mode_exit

  1. Reboot the ESXi host.

There are several versions of the modeswitch utility and the .vib version does require the removal of the vGPU Manager.

I personally would not recommend using the .vib version unless you are unable to use the bootable iso tool using the host remote managment software. The reason being that the ISO is a much simpler tool to work with and only requires 2 reboots, one to start the ISO and one to switch back to the hypervisor.

Using the .vib you need to

remove vGPU manager
restart
install mode switch .vib
restart
switch mode
restart
remove modeswitch .vib
restart
install vGPU .vib
restart

That’s 4 extra restarts to allow you to stay within ESXi. I find it so much faster to simply boot to the ISO.