Unable to detect second GPU Ubuntu 16.04/18.04

Hello,

I have been building a custom setup for Deep Learning and my specs are as follows:

CPU: i7 8700k
GPU: 2 * RTX 2070
Motherboard: Aorus Ultra Z390
PSU: 750 W
RAM: 16*2

I installed the latest Nvidia drivers along with CUDA 9.0 but running nvidia-smi shows only one GPU. I swapped the GPUs and still only one is detected so there is no issue with the GPUs. The undetected GPU is in a PCIe x8 lane and it lights up and the fans work. The third PCIe lane is x4 so it is not possible for me to test the GPU with that. I tried with Ubuntu 16.04 and 18.04 along with linux kernel versions 4.20,4.15 and 4.9 but none seem to detect it. I do not have a SLI bridge connected but that shouldn’t prevent it from being detected I guess. Could anyone please help me out?

what is the result of:

sudo lspci |grep -i nv

?

If that shows 1 GPU, then you have a hardware/motherboard issue, and no one can fix that but you.

If it shows 2 GPUs, then report what the output of the following is:

sudo dmesg |grep NVRM

It showed only 1 GPU, does that mean there is definitely a problem with the motherboard? Is there a motherboard which is recommended by nvidia for dual GPU systems? This is the second motherboard I have tried which has some kind of fault. Thanks.

You might want to try enabling “Above 4G Decoding” in the BIOS:

[url]http://download.gigabyte.us/FileList/Manual/mb_manual_z390-aorus-ultra_1001_181120_e.pdf[/url]

(However, I doubt that is the problem here.)

Otherwise, I don’t have any suggestions. The motherboard appears to support multiple GPUs, and it appears to be designed to automatically detect and configure when multiple GPUs are installed.

According to my read of the manual, it does appear that the x4 slot is also capable of hosting a GPU.

You should make sure that both GPUs have proper power connections.

Hi,

I’m having the same issue on a quite similar build:
i9 9900K,
Z390 (ASUS)
2x GTX 1080 Ti
Driver version 418.56, ubuntu 16.04

$ sudo lspci |grep -i nv

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)

: sudo dmesg |grep NVRM

[    1.430743] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  418.56  Fri Mar 15 12:59:26 CDT 2019
[   27.245009] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[   27.245068] NVRM: rm_init_adapter failed for device bearing minor number 1
[   33.438488] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[   33.438502] NVRM: rm_init_adapter failed for device bearing minor number 1

I tried turning on the “Above 4G Decoding” in BIOS and the kernel did not boot (blank screen after grub). The kernel boots after it’s turned off.

What can be done?

Hello,

Did you solve it? I also encounter same error.

The issue I encountered turned out to be a GPU hardware issue:

https://devtalk.nvidia.com/default/topic/1055438/detecting-1-of-2-gpus-nvrm-rminitadapter-failed

Hello! We have faced with this issue and found solution! We have motherboard MSI Z270 A-PRO with 2 m2-slots, GTX 1060 x2 and NMVe M2 SSD. Problem is SSD and GPU using the same pci-e lines, so we simply put SSD in other pci-slot.

Hello,

I have a similar issue.
Both my GPUs were detectable and it was working fine until it crashed suddenly. Upon checking I understood that now the second GPU was not detectable.
In addition, I do not get the login screen display.
Following fixes have worked for me before when the display failed:

  1. When I update the nvidia driver’s to 430 version, it seemed to fail, using a previous version, 418, worked for me previously.
  2. With 430 driver version I did the following to get the display up and running:
    cd /usr/lib/nvidia-430
    sudo rm libGL.so.1
    sudo ln -s libGL.so.1.7.0 libGL.so.1
    then reboot

Both these did not work for me now. And the second GPU still fails to be detected.
Following this post, I tried “lspci | grep -i nv” and both the GPU’s are visible.
I then tried, sudo dmesg |grep NVRM and I get the following message:

[ 1.358643] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.26 Tue Jun 4 17:40:52 CDT 2019
[ 2.257676] NVRM: failed to register with the ACPI subsystem!
[ 93.388633] NVRM: GPU at PCI:0000:01:00: GPU-c12d2e72-53e2-28cf-e177-40631190e05c
[ 93.388635] NVRM: GPU Board Serial Number:
[ 93.388635] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 93.388636] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 93.388636] NVRM: GPU 0000:01:00.0: GPU is on Board .
[ 93.436826] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 93.603879] NVRM: failed to unregister from the ACPI subsystem!
[ 93.847440] NVRM: failed to register with the ACPI subsystem!
[ 94.000936] NVRM: failed to unregister from the ACPI subsystem!

Not sure what needs to be done now.

Looking forward to hearing from someone soon.

I really hope the driver issues are fixed as I am constantly facing issues with it.

Thanks,
T

reboot the system, then do the

sudo dmesg |grep NVRM

again

Hello Robert,

After reboot:

[ 1.342888] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.26 Tue Jun 4 17:40:52 CDT 2019
[ 2.189248] NVRM: failed to register with the ACPI subsystem!
[ 93.324635] NVRM: GPU at PCI:0000:01:00: GPU-c12d2e72-53e2-28cf-e177-40631190e05c
[ 93.324636] NVRM: GPU Board Serial Number:
[ 93.324637] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 93.324638] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 93.324638] NVRM: GPU 0000:01:00.0: GPU is on Board .
[ 93.372824] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 93.539754] NVRM: failed to unregister from the ACPI subsystem!
[ 93.915523] NVRM: failed to register with the ACPI subsystem!
[ 94.068858] NVRM: failed to unregister from the ACPI subsystem!

Thank you!

Just to add to this, nvidia-smi query is not working too. The system hangs indefinitely without showing any results.

Your GPU looks broken to me. Overheating may be a possibility, since it seems to fall off the bus after ~90 seconds of on-time.

There was an over-heating issue. But I thought I fixed it as the temperatures came down. What is the possible fix? Remove the gpu and re-install or the whole part would be broken?

What sort of GPUs do you have? What are the GPU names and model numbers?

These are two GEFORCE RTX 2080Ti 11GB 352 Bit GDDR6

Typical hardware troubleshooting methods apply. Attempt to change system components to see if the problem stays with the GPU when moved to a new system, or stays with the old system.

It seems likely to me that the GPU is damaged, but it’s impossible to rule out other system factors such as an improperly installed GPU, lack of sufficient cooling, lack of sufficient power, improperly connected aux power, etc.

My suggestion would be to start over with a fresh system and install/set up only the misbehaving GPU. If the problem repeats itself, the GPU is likely damaged.

The fact that the GPU works correctly for ~90 seconds then falls off the bus suggests an overheating issue. If the GPU has previously overheated, it may have damaged the thermal connection between the GPU die/package itself and the GPU heatsink/fansink. If that were the case, this is not an end-user-repairable situation and is not covered by warranty. End user repairs attempting to address these items such as disassembly of the card also void the warranty, as far as I know.

If the system is usable briefly after start-up, you may be able to confirm the GPU overheating theory (before 90 seconds have elapsed and the GPU has fallen off the bus) by careful monitoring with nvidia-smi.

I will try to separate out the GPU’s in two different machines and try installing everything from scratch.
I hope it works out and the GPU is not damaged.

I will get back based on what I see.

Thank you very much for following up.

Before doing anything else, you might want to test the overheating theory if you can “see” the problematic GPU during the first ~90 seconds of system operation with nvidia-smi

The problem is nvidia-smi freezes and doesn’t give me any information. Earlier I could see only 1 GPU on nvidia-smi but now the console freezes after the command and there is no display too on my screen.