Nvidia-smi failed to detect all GPU cards

kchatzitheodorou · December 13, 2018, 3:42pm

I have an NVIDIA GeForce GTX 1080 Ti (GIGABYTE) installed on an Ubuntu 18.04 machine and now I am trying to install a second one similar (ASUS).

nvidia-smi does not detect the second card and sometimes Ubuntu is not able to restart. Here is nvidia-smi output:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1203 G /usr/lib/xorg/Xorg 115MiB |
| 0 1403 G /usr/bin/gnome-shell 76MiB |
±----------------------------------------------------------------------------+

and here is lspci | grep –I vga output:
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

I also tried with CUDA Version: 9.0 and NVIDIA-SMI 390 but I was getting out of memory errors using it with tensorflow. However, when I removed it, tensorflow runs without issues.

Any ideas to help ? Thanks!

Robert_Crovella · December 13, 2018, 3:46pm

what is the output of:

dmesg |grep NVRM

?

kchatzitheodorou · December 13, 2018, 4:02pm

Now I have only the GIGABYTE plugged.

dmesg | grep NVRM is

[ 1.476988] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.18 Thu Nov 15 22:01:24 CST 2018 (using threaded interrupts)

Thanks

kchatzitheodorou · December 14, 2018, 11:51am

Now I have only the GIGABYTE plugged.

dmesg | grep NVRM is

[ 1.476988] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.18 Thu Nov 15 22:01:24 CST 2018 (using threaded interrupts)

Thanks

Robert_Crovella · December 14, 2018, 2:51pm

I’m not able to say what is wrong with only one card plugged in. I was interested in the dmesg output when the system reports this:

kchatzitheodorou · December 14, 2018, 2:57pm

Sure, here you go

$ dmesg | grep NVRM
[ 1.621075] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.18 Thu Nov 15 22:01:24 CST 2018 (using threaded interrupts)
[ 122.068173] NVRM: GPU at PCI:0000:17:00: GPU-7fdb9608-2313-c9b6-ed03-ebb387f81d1d
[ 122.068179] NVRM: GPU Board Serial Number:
[ 122.068184] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1): Out Of Range Address
[ 122.068198] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x50ce48=0x24000e 0x50ce50=0x20 0x50ce44=0xd3eff2 0x50ce4c=0x17f
[ 122.068275] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1): Out Of Range Address
[ 122.068287] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x514e48=0x6000e 0x514e50=0x20 0x514e44=0xd3eff2 0x514e4c=0x17f
[ 122.068358] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 2): Out Of Range Address
[ 122.068369] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x515648=0x31000e 0x515650=0x20 0x515644=0xd3eff2 0x51564c=0x17f
[ 122.068444] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3): Out Of Range Address
[ 122.068456] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x515e48=0x12d000e 0x515e50=0x20 0x515e44=0xd3eff2 0x515e4c=0x17f
[ 122.068527] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 4): Out Of Range Address
[ 122.068539] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x516648=0x17000e 0x516650=0x20 0x516644=0xd3eff2 0x51664c=0x17f
[ 122.068601] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 0): Out Of Range Address
[ 122.068613] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x51c648=0x31000e 0x51c650=0x20 0x51c644=0xd3eff2 0x51c64c=0x17f
[ 122.068684] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1): Out Of Range Address
[ 122.068693] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x51ce48=0x109000e 0x51ce50=0x20 0x51ce44=0xd3eff2 0x51ce4c=0x17f
[ 122.068764] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3): Out Of Range Address
[ 122.068775] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x51de48=0xb000e 0x51de50=0x20 0x51de44=0xd3eff2 0x51de4c=0x17f
[ 122.068834] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0): Out Of Range Address
[ 122.068843] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x524648=0x138000e 0x524650=0x20 0x524644=0xd3eff2 0x52464c=0x17f
[ 122.068890] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 1): Out Of Range Address
[ 122.068898] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Global Exception on (GPC 4, TPC 1): Physical Multiple Warp Errors
[ 122.068905] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x524e48=0xa000e 0x524e50=0x24 0x524e44=0xd3eff2 0x524e4c=0x17f
[ 122.068952] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3): Out Of Range Address
[ 122.068960] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x525e48=0x32000e 0x525e50=0x20 0x525e44=0xd3eff2 0x525e4c=0x17f
[ 122.069009] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 2): Out Of Range Address
[ 122.069017] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x52d648=0x13000e 0x52d650=0x20 0x52d644=0xd3eff2 0x52d64c=0x17f
[ 303.843967] NVRM: Xid (PCI:0000:17:00): 38, 000f 0000c197 00000000 00000000 00000000
[ 510.660315] NVRM: Xid (PCI:0000:17:00): 38, 000f 0000c197 00000000 00000000 00000000

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2587 C /usr/bin/python 8MiB |
| 1 1243 G /usr/lib/xorg/Xorg 114MiB |
| 1 1446 G /usr/bin/gnome-shell 92MiB |
±----------------------------------------------------------------------------+

$ lspci | grep -i VGA
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

Robert_Crovella · December 14, 2018, 3:32pm

I’m confused. This question title reads:

“Nvidia-smi failed to detect all GPU cards”

But your current output shows that both are being detected:

$ nvidia-smi
Fri Dec 14 14:54:50 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18       Driver Version: 415.18       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 34%   68C    P0    95W / 250W |     20MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   58C    P5    26W / 250W |    214MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2587      C   /usr/bin/python                                8MiB |
|    1      1243      G   /usr/lib/xorg/Xorg                           114MiB |
|    1      1446      G   /usr/bin/gnome-shell                          92MiB |
+-----------------------------------------------------------------------------+

Seems like the question you were asking is resolved.

kchatzitheodorou · December 14, 2018, 3:34pm

Finally, it is detected however I am not able to run tendorflow because it fails. With one card I run it without any issues.

Robert_Crovella · December 14, 2018, 3:41pm

Whatever you are running on GPU 0 is code that is doing illegal things:

[ 122.068184] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1): Out Of Range Address

That is trying to run illegal operations on the GPU. That’s not anything that is a function of your setup. If your TF code is doing that, I’d say your TF code is broken, or perhaps you are running into a display timeout.

kchatzitheodorou · December 14, 2018, 3:44pm

I use the same code in other machines and it works!

Also, if I unplug the second graphics card and run the same code it works.

Any ideas?

Robert_Crovella · December 14, 2018, 3:48pm

maybe the GPU is defective
maybe your power supply is inadequate
maybe the GPU temperature is getting too high
maybe TF code is behaving differently when it sees 2 GPUs
maybe the motherboard is defective
maybe you are hitting a display timeout

kchatzitheodorou · December 14, 2018, 3:56pm

I am getting a tensorflow - CUDA_ERROR_LAUNCH_FAILED

Topic		Replies	Views
Ubuntu 16.04+2 GTX1080 Ti: Nvidia-smi failed to detect all GPUs CUDA Setup and Installation	9	10264	February 5, 2018
Unable to detect second GPU Ubuntu 16.04/18.04 CUDA Setup and Installation	22	12455	July 21, 2020
Nvidia command cannot see second GPU CUDA Setup and Installation cuda , ubuntu , nvbugs	1	1934	August 30, 2022
GPU not detected by nvidia-smi Linux	0	88	July 31, 2024
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	61921	February 14, 2021
GPU loss Linux	7	13640	April 3, 2019
One of two 1080Ti GPUs not detected after CUDA failure CUDA Setup and Installation	7	1401	April 27, 2018
New installed gpu is not detected by nvidia-smi Linux ubuntu	1	2321	January 13, 2021
Nvidia-settings gives errors 3090ti egpu dell laptop Ubuntu Linux ubuntu	8	1221	August 15, 2022
Second GPU not detected on Ubuntu 18.04.4 Linux cuda , ubuntu	5	2569	May 29, 2021

Nvidia-smi failed to detect all GPU cards

Related Topics