Unable to detect second GPU Ubuntu 16.04/18.04

kevaldoshi · January 20, 2019, 10:39pm

Hello,

I have been building a custom setup for Deep Learning and my specs are as follows:

CPU: i7 8700k
GPU: 2 * RTX 2070
Motherboard: Aorus Ultra Z390
PSU: 750 W
RAM: 16*2

I installed the latest Nvidia drivers along with CUDA 9.0 but running nvidia-smi shows only one GPU. I swapped the GPUs and still only one is detected so there is no issue with the GPUs. The undetected GPU is in a PCIe x8 lane and it lights up and the fans work. The third PCIe lane is x4 so it is not possible for me to test the GPU with that. I tried with Ubuntu 16.04 and 18.04 along with linux kernel versions 4.20,4.15 and 4.9 but none seem to detect it. I do not have a SLI bridge connected but that shouldn’t prevent it from being detected I guess. Could anyone please help me out?

Robert_Crovella · January 21, 2019, 3:06am

what is the result of:

sudo lspci |grep -i nv

?

If that shows 1 GPU, then you have a hardware/motherboard issue, and no one can fix that but you.

If it shows 2 GPUs, then report what the output of the following is:

sudo dmesg |grep NVRM

kevaldoshi · January 21, 2019, 3:30am

It showed only 1 GPU, does that mean there is definitely a problem with the motherboard? Is there a motherboard which is recommended by nvidia for dual GPU systems? This is the second motherboard I have tried which has some kind of fault. Thanks.

Robert_Crovella · January 21, 2019, 3:53am

You might want to try enabling “Above 4G Decoding” in the BIOS:

[url]http://download.gigabyte.us/FileList/Manual/mb_manual_z390-aorus-ultra_1001_181120_e.pdf[/url]

(However, I doubt that is the problem here.)

Otherwise, I don’t have any suggestions. The motherboard appears to support multiple GPUs, and it appears to be designed to automatically detect and configure when multiple GPUs are installed.

According to my read of the manual, it does appear that the x4 slot is also capable of hosting a GPU.

You should make sure that both GPUs have proper power connections.

mickeyil · June 5, 2019, 6:56am

Hi,

I’m having the same issue on a quite similar build:
i9 9900K,
Z390 (ASUS)
2x GTX 1080 Ti
Driver version 418.56, ubuntu 16.04

$ sudo lspci |grep -i nv

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)

: sudo dmesg |grep NVRM

[    1.430743] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  418.56  Fri Mar 15 12:59:26 CDT 2019
[   27.245009] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[   27.245068] NVRM: rm_init_adapter failed for device bearing minor number 1
[   33.438488] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[   33.438502] NVRM: rm_init_adapter failed for device bearing minor number 1

I tried turning on the “Above 4G Decoding” in BIOS and the kernel did not boot (blank screen after grub). The kernel boots after it’s turned off.

What can be done?

heylamourding · June 24, 2019, 5:53am

Hello,

Did you solve it? I also encounter same error.

mickeyil · June 24, 2019, 7:30am

The issue I encountered turned out to be a GPU hardware issue:

https://devtalk.nvidia.com/default/topic/1055438/detecting-1-of-2-gpus-nvrm-rminitadapter-failed

otravnick · August 15, 2019, 6:12am

Hello! We have faced with this issue and found solution! We have motherboard MSI Z270 A-PRO with 2 m2-slots, GTX 1060 x2 and NMVe M2 SSD. Problem is SSD and GPU using the same pci-e lines, so we simply put SSD in other pci-slot.

ta2184 · November 4, 2019, 2:50pm

Hello,

I have a similar issue.
Both my GPUs were detectable and it was working fine until it crashed suddenly. Upon checking I understood that now the second GPU was not detectable.
In addition, I do not get the login screen display.
Following fixes have worked for me before when the display failed:

When I update the nvidia driver’s to 430 version, it seemed to fail, using a previous version, 418, worked for me previously.
With 430 driver version I did the following to get the display up and running:
cd /usr/lib/nvidia-430
sudo rm libGL.so.1
sudo ln -s libGL.so.1.7.0 libGL.so.1
then reboot

Both these did not work for me now. And the second GPU still fails to be detected.
Following this post, I tried “lspci | grep -i nv” and both the GPU’s are visible.
I then tried, sudo dmesg |grep NVRM and I get the following message:

[ 1.358643] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.26 Tue Jun 4 17:40:52 CDT 2019
[ 2.257676] NVRM: failed to register with the ACPI subsystem!
[ 93.388633] NVRM: GPU at PCI:0000:01:00: GPU-c12d2e72-53e2-28cf-e177-40631190e05c
[ 93.388635] NVRM: GPU Board Serial Number:
[ 93.388635] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 93.388636] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 93.388636] NVRM: GPU 0000:01:00.0: GPU is on Board .
[ 93.436826] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 93.603879] NVRM: failed to unregister from the ACPI subsystem!
[ 93.847440] NVRM: failed to register with the ACPI subsystem!
[ 94.000936] NVRM: failed to unregister from the ACPI subsystem!

Not sure what needs to be done now.

Looking forward to hearing from someone soon.

I really hope the driver issues are fixed as I am constantly facing issues with it.

Thanks,
T

Robert_Crovella · November 4, 2019, 3:01pm

reboot the system, then do the

sudo dmesg |grep NVRM

again

ta2184 · November 4, 2019, 3:24pm

Hello Robert,

After reboot:

[ 1.342888] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.26 Tue Jun 4 17:40:52 CDT 2019
[ 2.189248] NVRM: failed to register with the ACPI subsystem!
[ 93.324635] NVRM: GPU at PCI:0000:01:00: GPU-c12d2e72-53e2-28cf-e177-40631190e05c
[ 93.324636] NVRM: GPU Board Serial Number:
[ 93.324637] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 93.324638] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 93.324638] NVRM: GPU 0000:01:00.0: GPU is on Board .
[ 93.372824] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 93.539754] NVRM: failed to unregister from the ACPI subsystem!
[ 93.915523] NVRM: failed to register with the ACPI subsystem!
[ 94.068858] NVRM: failed to unregister from the ACPI subsystem!

Thank you!

ta2184 · November 4, 2019, 3:30pm

Just to add to this, nvidia-smi query is not working too. The system hangs indefinitely without showing any results.

Robert_Crovella · November 4, 2019, 3:31pm

Your GPU looks broken to me. Overheating may be a possibility, since it seems to fall off the bus after ~90 seconds of on-time.

ta2184 · November 4, 2019, 3:34pm

There was an over-heating issue. But I thought I fixed it as the temperatures came down. What is the possible fix? Remove the gpu and re-install or the whole part would be broken?

Robert_Crovella · November 4, 2019, 3:36pm

What sort of GPUs do you have? What are the GPU names and model numbers?

ta2184 · November 4, 2019, 3:46pm

These are two GEFORCE RTX 2080Ti 11GB 352 Bit GDDR6

Robert_Crovella · November 4, 2019, 4:39pm

Typical hardware troubleshooting methods apply. Attempt to change system components to see if the problem stays with the GPU when moved to a new system, or stays with the old system.

It seems likely to me that the GPU is damaged, but it’s impossible to rule out other system factors such as an improperly installed GPU, lack of sufficient cooling, lack of sufficient power, improperly connected aux power, etc.

My suggestion would be to start over with a fresh system and install/set up only the misbehaving GPU. If the problem repeats itself, the GPU is likely damaged.

The fact that the GPU works correctly for ~90 seconds then falls off the bus suggests an overheating issue. If the GPU has previously overheated, it may have damaged the thermal connection between the GPU die/package itself and the GPU heatsink/fansink. If that were the case, this is not an end-user-repairable situation and is not covered by warranty. End user repairs attempting to address these items such as disassembly of the card also void the warranty, as far as I know.

If the system is usable briefly after start-up, you may be able to confirm the GPU overheating theory (before 90 seconds have elapsed and the GPU has fallen off the bus) by careful monitoring with nvidia-smi.

ta2184 · November 4, 2019, 4:50pm

I will try to separate out the GPU’s in two different machines and try installing everything from scratch.
I hope it works out and the GPU is not damaged.

I will get back based on what I see.

Thank you very much for following up.

Robert_Crovella · November 4, 2019, 5:37pm

Before doing anything else, you might want to test the overheating theory if you can “see” the problematic GPU during the first ~90 seconds of system operation with nvidia-smi

ta2184 · November 4, 2019, 5:41pm

The problem is nvidia-smi freezes and doesn’t give me any information. Earlier I could see only 1 GPU on nvidia-smi but now the console freezes after the command and there is no display too on my screen.

Topic		Replies	Views
Second GPU not detected on Ubuntu 18.04.4 Linux cuda , ubuntu	5	2663	May 29, 2021
Nvidia command cannot see second GPU CUDA Setup and Installation cuda , ubuntu , nvbugs	1	2254	August 30, 2022
Ubuntu 16.04+2 GTX1080 Ti: Nvidia-smi failed to detect all GPUs CUDA Setup and Installation	9	10363	February 5, 2018
Detecting 1 of 2 GPUs: NVRM: RmInitAdapter failed Linux	3	2396	October 12, 2021
GPU not detected Ubuntu Linux	35	101530	December 14, 2023
Ubuntu does not detect 2'nd GPU Linux	0	411	June 24, 2023
Nvidia-driver doesn't seem to recognize the second GPU Linux ubuntu	14	4894	April 20, 2021
Nvidia-smi failed to detect all GPU cards CUDA Setup and Installation	11	13629	December 14, 2018
Unable to detect second GPU Ubuntu 20.04 CUDA Setup and Installation	2	609	August 21, 2020
Ubuntu 20.04 only detects 1 GPU Linux	1	459	March 9, 2022

Unable to detect second GPU Ubuntu 16.04/18.04

Related topics