I want to confirm or find a list of supported servers for the Tesla M60 card with ESXi 6.0. I’ve installed on a Dell PowerEdge R720 the 352.54 vib, and am getting the "nvidia-smi has failed because it couldn’t communicate with the nvidia driver. make sure that the latest nvidia driver is installed and running." message. I’ve validated that using vmkload_mod - l that the driver appears to not be starting either. I’ve checked dmesg, but am unsure as to what to look for to indicate any error, but am not seeing much. I’ve also checked to see if the card shows up using "lspci | grep -i vga" and other variations, but do not see the card, which is why I am suspecting that an R720 will not work.
The driver installs fine, and I’ve done the necessary steps of Maintenance mode, install, reboot, and exit maintenance mode, now multiple times to no avail. I have an R730, that I can try next, but I want to validate that it is worth the effort first.
I’m hoping it’s a misconfiguration, and that I’ve missed a cable connection in the chassis, as I really would like to get this 720 to work. Please let me know.
You can filter the list of certified servers by card type.
R720 is not certified by Dell, R730 is, as long as you have the relevant psu’s, power cable etc. Best to check with Dell on the specific requirements to retrofit the card.
Thank you for that. That is extremely helpful.
Now for the next part, is there a difference in supported R730's? i.e. an R730xd versus an R730? You have listed an R730, and I can get my hands on one of those in the future, but I have an R730xd now, that appears to be exhibiting the same behavior. Would some form of logs be useful?
Most servers don’t ship with the required PSU, PCIe risers, cables etc and some may require modified heatsinks or airflow baffles. Each OEM has a different set of additional components which may be required for retrofit. In some cases it’s just a power cable, in others it’s a complete set of PSU’s, risers, baffles, heatsinks and cables so depending on what you already have, depends on what you need to acquire.
The OEM (in your case Dell) are the best people to ask for the details of what you require.
I know for the R720xd Dell chose not to certify whereas they did the R720 - the R720xd has some extra room for storage which makes everything else a bit more squashed and affected the thermal cooling iirc…
BUT as you are dealing with a possibly unsupported server I think as JAson suggests you really need to talk to DELL and your hypervisor vendor as you could be left unsupported even if it works.
I was able to install the gpumodeswitch vib in my ESX6.0u2 host and I was able to successfully change the mode over to graphics. Oddly the gpumodeswitch is able to see the cards. After the mode was changed I installed the software, rebooted the system and receive nothing when I run "vmkload_mod -l | grep nvidia"
I have two questions. I was able to install the GRID 3.0 VIB into ESXi 6.0 U2, (NVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_361.40-1OEM.600.0.0.2494585.vib) no issue and everything came up properly. However after the installation, the guide that I should use the gpumodeswitch to switch modes.
Interestingly the instruction in the switchmode doc said to remove any NVIDIA drivers - which was bit weird. But I did that.
I tried to install the modeswitch vib (NVIDIA-GpuModeSwitch-1OEM.600.0.0.2494585.x86_64.vib) and it gave me an InstallationError, saying the vib does not contain a signature. I lowered the acceptance level to community supported, but no luck to install.
There are several versions of the modeswitch utility and the .vib version does require the removal of the vGPU Manager.
I personally would not recommend using the .vib version unless you are unable to use the bootable iso tool using the host remote managment software. The reason being that the ISO is a much simpler tool to work with and only requires 2 reboots, one to start the ISO and one to switch back to the hypervisor.