Ubuntu 16.04+2 GTX1080 Ti: Nvidia-smi failed to detect all GPUs

Hi,

I have a Ubuntu 16.04 system and installed CUDA 8.0 and driver 375.82 for 2 GTX 1080 Ti card.
But the nvidia-smi only detect one card as follow:
$ nvidia-smi
Fri Aug 4 16:26:19 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82 Driver Version: 375.82 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 0000:01:00.0 Off | N/A |
| 0% 31C P0 60W / 360W | 0MiB / 11170MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

$ lspci|grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)

It seems the system can detect both GPU cards, but the driver can only detect one. Any ideas to help ? Thanks!

what is the output of:

dmesg |grep NVRM

?

#Nothing show for “dmesg|grep NVRM”, nothing too with sudo
$ dmesg|grep NVRM

Besides, sometimes, “nvidia-smi” can detect the Both GPUs, but with one in “ERR!” label as follows:

$ nvidia-smi
Sat Aug 5 11:31:09 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82 Driver Version: 375.82 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 0000:01:00.0 Off | N/A |
| 0% 34C P0 61W / 360W | 9135MiB / 11170MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 0000:02:00.0 Off | N/A |
|ERR! 33C P8 ERR! / 360W | 10MiB / 11172MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2033 C python 9133MiB |
| 1 2033 C python 8MiB |
±----------------------------------------------------------------------------+

that dmesg output doesn’t make sense to me, since the NVIDIA Driver puts a message in the system log when it is loading, like this:

$ dmesg |grep NVRM
[    6.655273] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017 (using threaded interrupts)

Something there doesn’t add up

maybe you haven’t removed nouveau properly

instead of 375.82 driver, you may want to try 384.59:

http://www.nvidia.com/download/driverResults.aspx/120917/en-us

Make sure to follow the instructions in the CUDA 8 linux install guide concerning removal of nouveau:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract

(probably a good idea to read the whole install guide, if you haven’t)

I reinstall the cuda-8.0 without install the default driver and install the driver 384.59.Now “nvidia-smi” also detects only one GPU card.

~$ nvidia-smi
Sun Aug  6 14:05:10 2017  
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   33C    P8    11W / 360W |    126MiB / 11169MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1033    G   /usr/lib/xorg/Xorg                              74MiB |
|    0      1772    G   compiz                                          41MiB |
|    0      2251    G   fcitx-qimpanel                                   8MiB |
+-----------------------------------------------------------------------------+
~$ dmesg|grep NVRM
[    2.081639] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  384.59  Wed Jul 19 23:53:34 PDT 2017 (using threaded interrupts)
[    3.226094] NVRM: Your system is not currently configured to drive a VGA console
[    4.183854] NVRM: GPU at PCI:0000:02:00: GPU-a29e3dde-b03b-5484-803c-c5bf5b3df99c
[    4.183857] NVRM: GPU Board Serial Number: 
[    4.183858] NVRM: Xid (PCI:0000:02:00): 62, 1bad(b2f4) 00000000 00000000
[   62.305789] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[   62.305834] NVRM: rm_init_adapter failed for device bearing minor number 1
[   75.165485] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[   75.165517] NVRM: rm_init_adapter failed for device bearing minor number 1

Other informations

~$ cat /proc/driver/nvidia/gpus/0000\:01\:00.0/information 
Model: 		 GeForce GTX 1080 Ti
IRQ:   		 135
GPU UUID: 	 GPU-637bcd08-b214-bd17-8b99-035cfea0b6a7
Video BIOS: 	 86.02.39.00.9c
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:01:00.0
Device Minor: 	 0

~$ cat /proc/driver/nvidia/gpus/0000\:02\:00.0/information 
Model: 		 GeForce GTX 1080 Ti
IRQ:   		 136
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:02:00.0
Device Minor: 	 1

The loaded nvidia modules:

~$ lsmod|grep nvidia
nvidia_drm             49152  1
nvidia_modeset        843776  5 nvidia_drm
nvidia              13041664  94 nvidia_modeset
drm_kms_helper        155648  2 i915_bpo,nvidia_drm
drm                   364544  5 i915_bpo,drm_kms_helper,nvidia_drm

Systems:

~$ uname -a
Linux nisp-dmi-02 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Module information:

~$ modinfo nvidia
filename:       /lib/modules/4.4.0-89-generic/kernel/drivers/video/nvidia.ko
alias:          char-major-195-*
version:        384.59
supported:      external
license:        NVIDIA
srcversion:     31FF0349D3C7B1D9A62B474
alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        
vermagic:       4.4.0-89-generic SMP mod_unload modversions 
parm:           NVreg_Mobile:int
parm:           NVreg_ResmanDebugLevel:int
parm:           NVreg_RmLogonRC:int
parm:           NVreg_ModifyDeviceFiles:int
parm:           NVreg_DeviceFileUID:int
parm:           NVreg_DeviceFileGID:int
parm:           NVreg_DeviceFileMode:int
parm:           NVreg_UpdateMemoryTypes:int
parm:           NVreg_InitializeSystemMemoryAllocations:int
parm:           NVreg_UsePageAttributeTable:int
parm:           NVreg_MapRegistersEarly:int
parm:           NVreg_RegisterForACPIEvents:int
parm:           NVreg_CheckPCIConfigSpace:int
parm:           NVreg_EnablePCIeGen3:int
parm:           NVreg_EnableMSI:int
parm:           NVreg_TCEBypassMode:int
parm:           NVreg_UseThreadedInterrupts:int
parm:           NVreg_EnableStreamMemOPs:int
parm:           NVreg_MemoryPoolSize:int
parm:           NVreg_RegistryDwords:charp
parm:           NVreg_RegistryDwordsPerDevice:charp
parm:           NVreg_RmMsg:charp
parm:           NVreg_AssignGpus:charp

The nouveau driver is blacklisted and unloaded:

~$ cat /etc/modprobe.d/blacklist-nouveau.conf 
blacklist nouveau
options nouveau modeset=0

Maybe you have insufficient power delivered to the 2nd GPU

I switched the power wire of GPU1 and GPU2, the same issue occured,
There exists water-cooled device for the GPUs, so it is hard for me to switch the GPU position to test.
So it is not the problem of software but hardware problem?

I have encountered these errors over the years and alway struggled to figure out what is really wrong.
Have I configured the kernel properly and have I used the right nvidia driver? Very frustrating that there is not a tool to help you check the software configuration.

I have three 450’s and a 750 in three different computers.

On the machine with a 450 and a 750 the 450 is working great but the 750
gets a video bios copy failure and this errro:

[ 62.305789] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)

And this when running nvidia-smi:

GPU UUID: GPU-???-???-???-???-???

The 450’s work any any of my three gentoo computers but the 750 never works.
I have decided that there is a hardware problem with the 750.
How can you test for hardware problems?