Nvidia-smi "No devices were found" - VMWare ESXI Ubuntu Server 20.04.03 with RTX3070

I have an Ubuntu Server 20.04.3 LTS (kernel 5.4.0) VM with a 3070 passed through via ESXI. It has been running stable for months with v470 drivers until a few days ago when the card was no longer recognized by the driver it seems (nvidia-smi produces “No devices were found”) .

What I have tried:

  • Spun up a Windows 10 VM and passed through the GPU, installed the driver and used it without issues, so it does not appear to be hardware.
  • Tested another GPU in the system with the same results.
  • Spun up 2 more Ubuntu Server VMs (one bios install and one EFI install) with the same results (No devices were found)
  • Purged and reinstalled the driver + CUDA many times, trying 450, 470, 470-sever, and now 510 with the same results each time.
  • Updated the motherboard BIOS to the latest v2.3 (Supermicro H12SSL-CT)
  • Reinstalled ESXI 7.0 on the host.

Would appreciate any help to solve this! I have attached my bug report as well.

I have seen some mentions the dmesg output could mean the GPU has failed but it works fine with Windows VMs and works fine in other machines.

nvidia-smi

No devices were found

sudo lspci |grep -i nv

03:00.0 VGA compatible controller: NVIDIA Corporation Device 2484 (rev a1)
03:00.1 Audio device: NVIDIA Corporation Device 228b (rev a1)

dmesg (After running nvidia-smi)

[ 1606.332778] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x26:0x56:1463)
[ 1606.332912] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0
[ 1607.004207] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x26:0x56:1463)
[ 1607.004349] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 0

cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf

blacklist nouveau
options nouveau modeset=0

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  510.47.03  Mon Jan 24 22:58:54 UTC 2022
GCC version:  gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

nvidia-bug-report.log.gz (664.4 KB)

Hi,
I have tried use ESXi6.5 U3 and use CentOS 7.9 VM is the same issue. I also use RTX3070. It works fine with windows VM?

Yes no issues on Windows 10. I get the “This device is working properly.” in Device Manager and can call nvidia-smi in Powershell without issue. It only seems to impact the GPU passed through on Linux.

nvidia-smi from esxi windows 10 vm

So sad, still can’t work on Windows 10.
1644473675620

I ran just nvidia-smi after installing the driver. If you installed the driver and everything then it seems like your issue might be a bit different…potentially hardware related?

Hey user167096, I’m having the exact same issue and it is driving me crazy. I have ESXi-7.0b running Ubuntu 18.04 (kernel 5.4.0-99) with an RTX3060 that’s been working great for the last few months until last weekend when the system suddenly stopped recognizing the card. There were no hardware or software updates, no reboots, nothing out of the ordinary. The card just stopped being recognized.

I’ve gone through all the same troubleshooting you have above of different drivers, kernels, major Ubuntu revisions, new installs, etc. and get the same “No devices were found” message when running nvidia-smi. The card is similarly shown in the lspci list, oddly with a device number instead of a model number, and gives the same RmInitAdapter failed messages from dmesg.

The card is recognized and runs fine in my Windows 10 VMs so it’s not a hardware issue.

It can’t be a coincidence that at the same time both of our machines experienced the same “vanishing gpu” issue. I’m also running a Supermicro motherboard, though the model number slips my mind. The GPU is an MSI RTX 3060. Also, while the descriptor doesn’t specifically call it out, this may be an LHR model. I wonder, is your GPU an MSI and/or have LHR as well?

This feels like some sort of Y2K time error for this failure to occur at the same time.

As frustrating as this issue is, it is good to hear I’m not alone with it. Please continue to update with anything you find. I agree it does not make sense for it to happen automatically to two individuals at the same time with similar configurations. I equally had a fully stable system and awoke to this issue where the card was no longer recognized.

For reference, I have an Nvidia 3070 FE, so it is not an LHR issue. I have also swapped in a 3060Ti FE and had the same issue. For that reason, I don’t think it is a GPU issue unless all the sudden it is being blocked for being a consumer card.

If you could get your Supermicro motherboard model# I would appreciate it. I do have an email out to Supermicro tech support but have not yet heard back.

It’s honestly baffling and I’m sure I’ve gotten a few grey hairs from just the last week or so of trying to fix it. My motherboard model is X9SRL-F-ME008.

I tried updating the bios/firmware as well for my card and all of the updating software said it was up to date so it doesn’t look like that would be an issue.

I did try it with an older GeForce GT 710 card I had and, unfortunately, the card was recognized as expected so that points to something with my 3060 or maybe the RTX 3 series in general for our setup.

I’ve initiated an RMA for now to try to get a replacement or at least see if there’s something MSI can do with it. I was hoping it wouldn’t come to that as that will bring my development environment for at least a month while it’s shipped off and “repaired” (hopefully they’ll replace it though). It bothers me that I still don’t know what the issue is and that it may still not resolve it but fingers crossed it’ll come back working. If not, I’ll have to try to build a test system to see if it’s the other components/combination.

2 Likes

Since the last post, a friend has offered up some parts for me to test a bare metal Linux environment to try to rule out ESXi/VMs before I send the card off for the RMA. It’ll probably be at least a day or two before he can get me the parts and I can set everything up. I’ll keep you posted on the results.

1 Like

UPDATE: I had a spare SSD so I just tested a bare metal install with Ubuntu Server 20.04.3 LTS, nvidia driver 510 and it installed without issue. I guess the next step is to test another hypervisor like Proxmox to see if it is the hypervisor or perhaps the motherboard bios?

UPDATE (again): I installed Proxmox, spun up an Ubuntu Server VM, passed through the card and installed drivers and it works. It seems like the issue is either 100% on Vmware ESXi or Supermicro motherboard/ESXi combination.

Wow, you’ve been busy! Thank you for going through the trouble of testing both of those configurations. I guess I’ll start looking at what it will take to migrate my setup away from ESXi and cancel the RMA (probably jumped the gun on that anyway haha).

If it’s narrowed down to looking like ESXi or it in combination with a Supermicro MB is the culprit, I have to wonder just what changed in February to make the GPUs unusable.

I’ll still be curious to hear your results if you do a bare metal install. I’m not yet giving up on ESXi but we will see how I feel about that by the end of the weekend.

The same issue on 18.04 and ESXI 6.7.
The problem is in unattended ubuntu autoupgrades running in the background.
I have restored the VM snapshot created couple of months ago with network cards disabled and just switched off autoupgrades in “/etc/apt/apt.conf.d/20auto-upgrades”. Then after vm restart I have enabled the network. Wihout disabling the network you vm will try do upgade immediately and you will have no chance to disable the autoupgrade configuraton
BR
Remi

I’m testing on two different ASUS motherboard (TUF GAMING B560M-PLUS and ROG STRIX Z490-A GAMING) with 3070 & 3090. It’s still the same issue.
I was testing GTX 700 with B560M-PLUS on ESXi 6.7 and Unbuntu 20.04 one month ago. It works well. Maybe should post to vmware community.

I still have not been able to find the root cause, but I was able to get a Debian VM going with their nvidia-driver package. It runs driver v460.91.03 so not the latest and greatest, but it works.

FYI the ESXi advanced configuration key hypervisor.cpuid.v0 = "FALSE" was required for this to work.

I’ve got an update on my side. I just finished putting together the bare metal system and the gpu shows up just fine. I’m running Ubuntu 20.04 and driver 510.47.03 as well on this system with an Asus motherboard and the nvidia-smi output is the same as what you’ve got for the systems that work. Pytorch 1.4 has been installed as well and is also able to see it as a cuda device so it looks like everything is running well with bare metal.

I’ll probably stick with this setup for now since the main VM I ran was this environment. It’ll lose out on some of the perks of being a server but will give me time to figure out what the next steps for the other hardware will be.


It works on Proxmox.

1 Like

Good to know. So this does not appear to be motherboard related and instead an ESXi/Ubuntu issue.

If you were trying to stick with ESXi, perhaps give Debian a try as your Guest OS. It worked for me so I’ll probably stick with it.

Hi Everyone, I have a RTX 3080 on ESXi 7.0U2 install. I have this setup previously working on bare metal 20.04.3 and Cuda toolkit 11.6 with good results from deviceQuery. My 3080 would not show up using 20.04.3 and ESXI 7.0u3. I did have a GTX1050 lying around and could get this to work.

Thank you for suggesting using the Nvidia-driver-460, I have got the 3080 to work now. That is a massive help. I was about to go back to bare metal.