Nvidia-smi failed to detect all GPU cards

I have an NVIDIA GeForce GTX 1080 Ti (GIGABYTE) installed on an Ubuntu 18.04 machine and now I am trying to install a second one similar (ASUS).

nvidia-smi does not detect the second card and sometimes Ubuntu is not able to restart. Here is nvidia-smi output:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18 Driver Version: 415.18 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:17:00.0 On | N/A |
| 0% 57C P2 64W / 250W | 11062MiB / 11177MiB | 1% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1203 G /usr/lib/xorg/Xorg 115MiB |
| 0 1403 G /usr/bin/gnome-shell 76MiB |
±----------------------------------------------------------------------------+

and here is lspci | grep –I vga output:
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

I also tried with CUDA Version: 9.0 and NVIDIA-SMI 390 but I was getting out of memory errors using it with tensorflow. However, when I removed it, tensorflow runs without issues.

Any ideas to help ? Thanks!

what is the output of:

dmesg |grep NVRM

?

1 Like

Now I have only the GIGABYTE plugged.

dmesg | grep NVRM is

[ 1.476988] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.18 Thu Nov 15 22:01:24 CST 2018 (using threaded interrupts)

Thanks

Now I have only the GIGABYTE plugged.

dmesg | grep NVRM is

[ 1.476988] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.18 Thu Nov 15 22:01:24 CST 2018 (using threaded interrupts)

Thanks

I’m not able to say what is wrong with only one card plugged in. I was interested in the dmesg output when the system reports this:

Sure, here you go

$ dmesg | grep NVRM
[ 1.621075] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.18 Thu Nov 15 22:01:24 CST 2018 (using threaded interrupts)
[ 122.068173] NVRM: GPU at PCI:0000:17:00: GPU-7fdb9608-2313-c9b6-ed03-ebb387f81d1d
[ 122.068179] NVRM: GPU Board Serial Number:
[ 122.068184] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1): Out Of Range Address
[ 122.068198] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x50ce48=0x24000e 0x50ce50=0x20 0x50ce44=0xd3eff2 0x50ce4c=0x17f
[ 122.068275] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 1): Out Of Range Address
[ 122.068287] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x514e48=0x6000e 0x514e50=0x20 0x514e44=0xd3eff2 0x514e4c=0x17f
[ 122.068358] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 2): Out Of Range Address
[ 122.068369] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x515648=0x31000e 0x515650=0x20 0x515644=0xd3eff2 0x51564c=0x17f
[ 122.068444] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 3): Out Of Range Address
[ 122.068456] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x515e48=0x12d000e 0x515e50=0x20 0x515e44=0xd3eff2 0x515e4c=0x17f
[ 122.068527] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 4): Out Of Range Address
[ 122.068539] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x516648=0x17000e 0x516650=0x20 0x516644=0xd3eff2 0x51664c=0x17f
[ 122.068601] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 0): Out Of Range Address
[ 122.068613] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x51c648=0x31000e 0x51c650=0x20 0x51c644=0xd3eff2 0x51c64c=0x17f
[ 122.068684] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 1): Out Of Range Address
[ 122.068693] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x51ce48=0x109000e 0x51ce50=0x20 0x51ce44=0xd3eff2 0x51ce4c=0x17f
[ 122.068764] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 3, TPC 3): Out Of Range Address
[ 122.068775] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x51de48=0xb000e 0x51de50=0x20 0x51de44=0xd3eff2 0x51de4c=0x17f
[ 122.068834] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 0): Out Of Range Address
[ 122.068843] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x524648=0x138000e 0x524650=0x20 0x524644=0xd3eff2 0x52464c=0x17f
[ 122.068890] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 1): Out Of Range Address
[ 122.068898] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Global Exception on (GPC 4, TPC 1): Physical Multiple Warp Errors
[ 122.068905] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x524e48=0xa000e 0x524e50=0x24 0x524e44=0xd3eff2 0x524e4c=0x17f
[ 122.068952] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 4, TPC 3): Out Of Range Address
[ 122.068960] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x525e48=0x32000e 0x525e50=0x20 0x525e44=0xd3eff2 0x525e4c=0x17f
[ 122.069009] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 2): Out Of Range Address
[ 122.069017] NVRM: Xid (PCI:0000:17:00): 13, Graphics Exception: ESR 0x52d648=0x13000e 0x52d650=0x20 0x52d644=0xd3eff2 0x52d64c=0x17f
[ 303.843967] NVRM: Xid (PCI:0000:17:00): 38, 000f 0000c197 00000000 00000000 00000000
[ 510.660315] NVRM: Xid (PCI:0000:17:00): 38, 000f 0000c197 00000000 00000000 00000000

$ nvidia-smi
Fri Dec 14 14:54:50 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18 Driver Version: 415.18 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:17:00.0 Off | N/A |
| 34% 68C P0 95W / 250W | 20MiB / 11178MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 00000000:65:00.0 On | N/A |
| 0% 58C P5 26W / 250W | 214MiB / 11177MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2587 C /usr/bin/python 8MiB |
| 1 1243 G /usr/lib/xorg/Xorg 114MiB |
| 1 1446 G /usr/bin/gnome-shell 92MiB |
±----------------------------------------------------------------------------+

$ lspci | grep -i VGA
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

I’m confused. This question title reads:

“Nvidia-smi failed to detect all GPU cards”

But your current output shows that both are being detected:

Seems like the question you were asking is resolved.

Finally, it is detected however I am not able to run tendorflow because it fails. With one card I run it without any issues.

Whatever you are running on GPU 0 is code that is doing illegal things:

[ 122.068184] NVRM: Xid (PCI:0000:17:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1): Out Of Range Address

That is trying to run illegal operations on the GPU. That’s not anything that is a function of your setup. If your TF code is doing that, I’d say your TF code is broken, or perhaps you are running into a display timeout.

I use the same code in other machines and it works!

Also, if I unplug the second graphics card and run the same code it works.

Any ideas?

maybe the GPU is defective
maybe your power supply is inadequate
maybe the GPU temperature is getting too high
maybe TF code is behaving differently when it sees 2 GPUs
maybe the motherboard is defective
maybe you are hitting a display timeout

I am getting a tensorflow - CUDA_ERROR_LAUNCH_FAILED