GPU not being recognized - No devices were found returned by nvidia-smi

Hi,

I’m having an issue with my GPU server. nvidia-smi is returning “No devices were found”.
When I use:
lspci -k | grep -A 2 -E "(VGA|3D)"

it returns

01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1) Subsystem: ASUSTeK Computer Inc. GA106 [GeForce RTX 3060 Lite Hash Rate] Kernel driver in use: nvidia

I tried downgrading the driver version and installing it from a file instead of dpkg, and a lot of other things I could find, but nothing helps.

Please find the debug log attached.

Can someone please provide guidance on how to fix this issue?

nvidia-bug-report.log.gz (282.6 KB)

Dear Ivan,
From the logs, it appeared that you are having issues with Asus RTX 2060 LHR cards.
Request you to check with Asus if this is a known issue at their end or have the latest VBIOS available for Asus cards.
Meanwhile, I am also checking within and across teams for same GPU so that I can try to recreate issue locally.
Also, can you please share fresh bug report just immediately after triggering issue (Suggest you reboot system once before trying to trigger issue so that report has only relevant logs)

Hi @amrits,

Yes, we have issues with Asus RTX 3060 LHR cards on multiple devices. Do you have a recommendation on how to reach out to Asus?

Also, can you please share fresh bug report just immediately after triggering issue (Suggest you reboot system once before trying to trigger issue so that report has only relevant logs)

If you take a closer look, you’ll see that the issue occurs immediatelly after booting the device, so we can’t get a better log than the one that we attached.
This is the relevant part of the log where you can see that GPU is not being registered at boot.

Jan 19 08:29:08 forsight-desktop kernel: [    5.008866] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
Jan 19 08:29:08 forsight-desktop kernel: [    5.058634] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.60.11  Wed Nov 23 23:04:03 UTC 2022
Jan 19 08:29:08 forsight-desktop kernel: [    5.100330] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.60.11  Wed Nov 23 22:49:17 UTC 2022
Jan 19 08:29:08 forsight-desktop kernel: [    5.105242] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Jan 19 08:29:08 forsight-desktop kernel: [    5.376815] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1427)
Jan 19 08:29:08 forsight-desktop kernel: [    5.377047] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 19 08:29:08 forsight-desktop kernel: [    5.377092] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
Jan 19 08:29:08 forsight-desktop kernel: [    5.377163] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
Jan 19 08:29:08 forsight-desktop kernel: [    5.441738] nvidia-uvm: Loaded the UVM driver, major device number 507.
Jan 19 08:29:15 forsight-desktop kernel: [   12.234685] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1427)
Jan 19 08:29:15 forsight-desktop kernel: [   12.234707] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 19 08:29:15 forsight-desktop kernel: [   12.302359] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1427)
Jan 19 08:29:15 forsight-desktop kernel: [   12.302378] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 19 08:29:15 forsight-desktop kernel: [   12.370612] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1427)
Jan 19 08:29:15 forsight-desktop kernel: [   12.370629] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 19 08:29:15 forsight-desktop kernel: [   12.437986] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1427)
Jan 19 08:29:15 forsight-desktop kernel: [   12.438007] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 19 08:29:15 forsight-desktop kernel: [   12.506141] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1427)
Jan 19 08:29:15 forsight-desktop kernel: [   12.506162] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 19 08:29:15 forsight-desktop kernel: [   12.576142] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1427)

Looking forward to your suggestions.

The underlying issue is that no initial framebuffer is handed over from bios. If you connect a monitor to the nvidia gpu, are the POST messages displayed? does the driver work in that condition?
Should be fixable by a vbios update from asus. Contacting them might be in vain as they deny any support when running linux. Rather check their support site, e.g.
https://www.asus.com/us/motherboards-components/graphics-cards/dual/dual-rtx3060-o12g-v2/helpdesk_bios/?model2Name=DUAL-RTX3060-O12G-V2
(Don’t know your specific model, you’ll have to check.)
If that doesn’t help, you might try some system bios tweaks, e.g. setting primary vga adapter explicitly, disabling fast boot, enabling csm.

Hi @generix, thank you for your reply.
Unfortunately, we don’t have physical access to the device since it’s deployed at our customer location.
We would like to avoid a site visit in order to fix this issue.

I checked the ASUS link, but it’s for Windows.

I tried using nvflash to install newer VBIOS, and the GPU did show up just for a moment, and then it again stopped working with the same error.

This is the exact device that we have installed:

NVIDIA Firmware Update Utility (Version 5.792.0)
Copyright (C) 1993-2022, NVIDIA Corporation. All rights reserved.

NVIDIA display adapters present in the system:
<0> Graphics Device (10DE,2504,1043,8810) S:00,B:01,D:00,F:00

Do you have any other ideas which we could try without physical access or a site visit?

Thank you!

So the correct bios update would be this:
https://www.asus.com/us/motherboards-components/graphics-cards/phoenix/ph-rtx3060-12g-v2/helpdesk_bios/?model2Name=PH-RTX3060-12G-V2
You could try extracting it on a Windows machine, the vbios files are in it. Then you could try flashing those and reboot.
Other than that, every other method would require physical access to the machine since it’s a low-level init issue.

@generix Thank you. I unpacked the Windows exe file and found the VBIOS files and tried flashing them.
The process succeeds without any issues, but the issue is still there. I tried rebooting, reinstalling different versions of drivers, but nothing did the trick.

I assume that we’ll have to perform the BIOS changes :S

What’s weird though is that this device has been working without any issues and after an update the GPU wasn’t visible anymore. I’m puzzled how this can have something to do with BIOS settings.

That’s an important piece of information that was missing before.
Taking this into account, might be either an issue with grub or even defective hardware or something else. I guess without attaching a monitor you won’t find out what’s really going on.

How can I debug issues with GRUB remotely? Do you have any proposal?
Thank you!

I’d rather say “not at all”, since everything that happens before the kernel loads is unknown. You can of course do some indirect checks, check version installed, check apt history whether that got updated when the issues started, check grub config for odd entries.

@generix Okay, we’ve got the device back from the field and connected the device to a monitor.

The device boots normally and the GPU shows in nvidia-smi.

What should we do to prevent this behavior and have reliable behavior of the GPU even when the monitor is not plugged in?

Kinda odd this happened spontaneously. Please check the setting in bios “primary video”. setting it to auto makes it depend on a monitor connected.

@generix Thank you for your reply.
Would a dummy HDMI plug fix this issue?

If you can reliably reproduce the issue (start with monitor connected - nvidia works, start with monitor disconnected - nvidia doesn’t work) this might be a valid workaround.

Hi @generix, we’re again observing this issue… we recalled a unit from our customer and sent a replacement unit to the same site. In our HQ, everything worked as supposed during testing and the GPU would show without any issues and our software would work as well.

Now, we have the same issue after installing the device at the customers location:

[ 889.927494] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:667)
[ 889.927509] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Could this issue be related to power issues with the outlet?

Thank you!

This is now a completely different issue, would point to either the gpu being installed in an unsupported vm or broken.
Both errors you were running into are extremely unlikely to get with a bare-metal install so unstable/unclean power might be a valid reason.